Prediction on Diabetes Patient's Hospital Readmission¶

  • Problem Statement and Objective

A hospital readmission is when a patient who is discharged from the hospital, gets re-admitted again within a certain period of time. Hospital readmission rates for certain conditions are now considered an indicator of hospital quality, and also affect the cost of care adversely. For this reason, Centers for Medicare & Medicaid Services established the Hospital Readmissions Reduction Program which aims to improve quality of care for patients and reduce health care spending by applying payment penalties to hospitals that have more than expected readmission rates for certain conditions. Although diabetes is not yet included in the penalty measures, the program is regularly adding new disease conditions to the list, now totaling 6 for FY2018. In 2011, American hospitals spent over $41 billion on diabetic patients who got readmitted within 30 days of discharge. Being able to determine factors that lead to higher readmission in such patients, and correspondingly being able to predict which patients will get readmitted can help hospitals save millions of dollars while improving quality of care. So, with that background in mind, we used a medical claims dataset (description below), to answer these questions:

Problem statment¶

The goal is to predict the chances of readmission of patients to the hospital considering certain changes in medication procedures, length of stay at the hospital, HbA1c measurement, age group, and multiple factors.

Step 1 - Data Wrangling¶

(a) - Gathering data¶

Data Sets:

• The project utilizes the "Diabetes 130-US hospitals for years 1999-2008 Data Set" from the UCI Machine Learning Repository. This dataset contains a wealth of information about diabetic patients, including demographics, diagnoses, medications, and outcomes, making it well-suited for predictive modeling.

Data Sources:

• The dataset is sourced from the UCI Machine Learning Repository, a reputable source of publicly available datasets for research and educational purposes. Access to this dataset ensures compliance with data privacy regulations and ethical considerations, providing a reliable foundation for the project's analysis and modeling efforts.

Import Libraries¶

In [7]:
pip install plotly --upgrade
Requirement already satisfied: plotly in c:\users\mahen\anaconda3\anaconda\lib\site-packages (5.23.0)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\mahen\anaconda3\anaconda\lib\site-packages (from plotly) (8.2.2)
Requirement already satisfied: packaging in c:\users\mahen\anaconda3\anaconda\lib\site-packages (from plotly) (23.2)
Note: you may need to restart the kernel to use updated packages.
In [8]:
import pandas as pd
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [9]:
import warnings
warnings.filterwarnings('ignore')
In [10]:
df = pd.read_csv("diabetic_data.csv")
df.head()    # Have a look to the data
Out[10]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
0 2278392 8222157 Caucasian Female [0-10) ? 6 25 1 1 ... No No No No No No No No No NO
1 149190 55629189 Caucasian Female [10-20) ? 1 1 7 3 ... No Up No No No No No Ch Yes >30
2 64410 86047875 AfricanAmerican Female [20-30) ? 1 1 7 2 ... No No No No No No No No Yes NO
3 500364 82442376 Caucasian Male [30-40) ? 1 1 7 2 ... No Up No No No No No Ch Yes NO
4 16680 42519267 Caucasian Male [40-50) ? 1 1 7 1 ... No Steady No No No No No Ch Yes NO

5 rows × 50 columns

In [11]:
df.columns
Out[11]:
Index(['encounter_id', 'patient_nbr', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')
In [12]:
df.shape
Out[12]:
(101766, 50)

(b) - Assessing Data¶

summary of data¶

  • Introduction

The Diabetes 130-US Hospitals Dataset is a comprehensive dataset sourced from the UCI Machine Learning Repository, encompassing data collected over a decade (1999-2008) from 130 hospitals across the United States. The dataset contains valuable information on diabetic patients' healthcare, including demographics, diagnoses, treatments, and outcomes. With 101,766 patient records and 50 attributes, this dataset is an extensive resource for predictive modeling, particularly in predicting hospital readmissions among diabetic patients.

  • Dataset Overview

The dataset consists of 101,766 rows (patient encounters) and 50 columns (features). Each row represents a unique patient encounter at a hospital, and the columns provide detailed information about the patient's demographics, medical conditions, hospital procedures, medications, and outcomes. The dataset is particularly rich in information that can be leveraged to predict whether a patient is likely to be readmitted to the hospital after discharge.

  • Key Attributes:
  1. Patient Demographics: Includes attributes like race, gender, age, and weight.

  2. Hospital Admission Details: Captured through attributes such as admission_type_id, discharge_disposition_id, and admission_source_id.

  3. Clinical Measurements: Features like time_in_hospital, num_lab_procedures, num_procedures, and num_medications reflect the intensity of care.

  4. Medical Diagnoses: diag_1, diag_2, and diag_3 provide diagnostic codes for primary, secondary, and tertiary diagnoses.

  5. Medications: Information on specific diabetes medications administered, such as metformin, insulin, and others.

  6. Outcome: The readmitted column indicates whether the patient was readmitted to the hospital, which serves as the target variable for prediction.

Column descriptions¶

  • Encounter ID - Unique identifier of an encounter

  • Patient number -Unique identifier of a patient

  • Race Values - Caucasian,Asian,African,Hispanic and other

  • Gender Values - Male,female and unknown/invalid

  • Age - Grouped in 10-year interval: 0,10), 10,20),......,90,100)

  • Weight - weight in pound

  • Admission - Type integer identifier corresponding to 9 distinct values,for example,emergency,urgent,elective,newborn and not available

  • Discharge disposition - Integer identifier corresponding to 29 distinct values. for example ,discharge to home,expired,and not availabe

  • Admission source - Integer identifier corresponding to 21 distint values.for example, physician referral,emergency room and transfer from a hospital

  • Time in Hospital - Integer number of days between admission and discharge

  • Payer code - Integer identifier corresponding to 23 distint values,for example,Blue Cross/Blue Shield,Medicare and self-pay medical

  • Medical specialty - Integer identifier of a specialty of the admitting physician,corresponding to 84 distint values. for example cardiology,internal medicine,family/general practice and surgeon

  • Number of lab procedures - Number of lab tests performed during the encounter

  • Number of procedures - Numeric number of procedures(other than lab tests) performed during the encounter

  • Number of Medications - Numeric of distinct generic names administered during the encounter

  • Number of outpatient - visit number of outpatient visit of the patient in the year preceding the encounter

  • Number of Emergency - visit number of Emergency visit of the patient in the year preceding the encounter

  • Number of inpatient - visit number of inpatient visit of the patient in the year preceding the encounter

  • Diagnosis 1 - The primary diagnosis(coded as first three digits of ICD9): 848 Distinct values

  • Diagnosis 2 - The Secondary diagnosis(coded as first three digits of ICD9): 923 Distinct values

  • Diagnosis 3 - Additional secondary diagnosis(coded as first three digits of ICD9): 954 Distinct values

  • Number of diagnoses - Number of diagnoses entered to the system 0%

  • Glucose serum test result - Indicates the range of the result or if the test was not taken.values:">200",">300","normal" and "none"if not measured

  • A1c test result - indicates the range of the result or if the test was not taken.Values: ">8" if the result was greater than 8%,">7" if the result was greater than 7% but less than 8%,"normal" if the result was less than 7%,and "none" if not measures.

  • Change of medications - indicates if there was a change in diabetic medication (either doses or generic name),Values:"change" and "no change"

  • Diabetes medication - Indicates if there was any diabetic medication prescribed.values: "yes" and "no"

  • 24 features - For medications for the generic name: metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin, glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone and metformin-pioglitazone.The feature indicates whether the drug was prescribed or there was a change in the dosage. values "up" if the dosage was increased during the encounter,"down" if the dosage was decreased,"steady" if the dosage did not change, and "no" if the drug was not prescribed

  • Readmitted - Days to inpatient readmission,values: "<30" if the patient was readmitted in less than 30 days,">30" if the patient was readmitted in more than 30 days, and "No" for no record of readmission

Variables¶

  • Patient identifiers - encounter_id,patient_nbr
  • Patient demographics - race,gender,age,weight,payer_code
  • Admission and Discharge details - admission_source_id,admission_type_id,discharge_disposition_id
  • Patient medical history - number_outpatient,number_inpatient,number_emergency
  • Patient Admission details - medical_speacialty,diag_1,diag_2,diag_3,time_in_hospital,number_diagnoses,num_lab_procedures,num_procedures,num_medications
  • Clinical Result - max_glu_serum,A1cresult
  • Medication Details - diabetesMed,change,23 feature for medications
  • Readmission indicator - readmitted

ICD¶

  • ICD-9 stands for the International Classification of Diseases, 9th Revision. It's a coding system used to classify diseases and other health problems for the purpose of billing, statistical analysis, and medical research. The ICD-9 system was developed by the World Health Organization (WHO) and it was used widely until it was replaced by ICD-10.

Max_Glu_Serum (Maximum Glucose Serum):¶

  • This test measures the concentration of glucose (sugar) in the blood serum. It's commonly used to diagnose and monitor diabetes mellitus. Elevated levels of glucose in the blood can indicate diabetes or other metabolic disorders. The test is typically done after fasting for a certain period (Fasting Blood Glucose Test) or randomly (Random Blood Glucose Test).

A1C Result (Hemoglobin A1C):¶

  • This test measures the average blood glucose levels over the past two to three months. It provides an indication of long-term blood sugar control. The A1C test measures the percentage of hemoglobin (a protein in red blood cells) that has glucose attached to it. It's commonly used to monitor the effectiveness of diabetes treatment and to assess the risk of diabetes-related complications. A lower A1C level indicates better blood sugar control.

  • Both of these tests are important tools in the diagnosis and management of diabetes mellitus. They help healthcare providers assess blood sugar levels over different time frames and make informed decisions about treatment and lifestyle modifications.

In [21]:
df.tail()
Out[21]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
101761 443847548 100162476 AfricanAmerican Male [70-80) ? 1 3 7 3 ... No Down No No No No No Ch Yes >30
101762 443847782 74694222 AfricanAmerican Female [80-90) ? 1 4 5 5 ... No Steady No No No No No No Yes NO
101763 443854148 41088789 Caucasian Male [70-80) ? 1 1 7 1 ... No Down No No No No No Ch Yes NO
101764 443857166 31693671 Caucasian Female [80-90) ? 2 3 7 10 ... No Up No No No No No Ch Yes NO
101765 443867222 175429310 Caucasian Male [70-80) ? 1 1 7 6 ... No No No No No No No No No NO

5 rows × 50 columns

In [22]:
df.sample(5)
Out[22]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
20228 71473086 23488398 Caucasian Male [50-60) ? 5 1 17 1 ... No Steady No No No No No No Yes NO
15448 59283900 16150374 Caucasian Male [40-50) ? 3 18 1 2 ... No Steady No No No No No Ch Yes NO
5848 29848692 3031074 Caucasian Female [60-70) ? 2 1 1 11 ... No No No No No No No Ch Yes NO
88891 286527180 38637351 AfricanAmerican Female [70-80) ? 3 1 1 1 ... No No No No No No No No No NO
29594 96544062 61080255 Caucasian Male [70-80) ? 1 1 7 4 ... No Steady No No No No No No Yes <30

5 rows × 50 columns

In [23]:
# Checking Dimension of the Data
df.shape
Out[23]:
(101766, 50)
In [24]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_medications           101766 non-null  int64 
 15  number_outpatient         101766 non-null  int64 
 16  number_emergency          101766 non-null  int64 
 17  number_inpatient          101766 non-null  int64 
 18  diag_1                    101766 non-null  object
 19  diag_2                    101766 non-null  object
 20  diag_3                    101766 non-null  object
 21  number_diagnoses          101766 non-null  int64 
 22  max_glu_serum             5346 non-null    object
 23  A1Cresult                 17018 non-null   object
 24  metformin                 101766 non-null  object
 25  repaglinide               101766 non-null  object
 26  nateglinide               101766 non-null  object
 27  chlorpropamide            101766 non-null  object
 28  glimepiride               101766 non-null  object
 29  acetohexamide             101766 non-null  object
 30  glipizide                 101766 non-null  object
 31  glyburide                 101766 non-null  object
 32  tolbutamide               101766 non-null  object
 33  pioglitazone              101766 non-null  object
 34  rosiglitazone             101766 non-null  object
 35  acarbose                  101766 non-null  object
 36  miglitol                  101766 non-null  object
 37  troglitazone              101766 non-null  object
 38  tolazamide                101766 non-null  object
 39  examide                   101766 non-null  object
 40  citoglipton               101766 non-null  object
 41  insulin                   101766 non-null  object
 42  glyburide-metformin       101766 non-null  object
 43  glipizide-metformin       101766 non-null  object
 44  glimepiride-pioglitazone  101766 non-null  object
 45  metformin-rosiglitazone   101766 non-null  object
 46  metformin-pioglitazone    101766 non-null  object
 47  change                    101766 non-null  object
 48  diabetesMed               101766 non-null  object
 49  readmitted                101766 non-null  object
dtypes: int64(13), object(37)
memory usage: 38.8+ MB
  • There are 101766 samples present in this dataset
In [26]:
print("According to the data, there are a total of 50 columns")

# Select categorical columns
df_cat = df.select_dtypes(include='object')

# Select numerical columns
df_num = df.select_dtypes(include=np.number)

# Print the number and names of categorical columns
print('Total categorical columns: ', len(df_cat.columns))
print('\nCategorical column names: ', df_cat.columns.tolist())
print('\n')

# Print the number and names of numerical columns
print('Total numerical columns: ', len(df_num.columns))
print('\nNumerical column names: ', df_num.columns.tolist())
According to the data, there are a total of 50 columns
Total categorical columns:  37

Categorical column names:  ['race', 'gender', 'age', 'weight', 'payer_code', 'medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted']


Total numerical columns:  13

Numerical column names:  ['encounter_id', 'patient_nbr', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']
In [27]:
# Checking whether our data contains any missing value or not
df.isnull().sum()
Out[27]:
encounter_id                    0
patient_nbr                     0
race                            0
gender                          0
age                             0
weight                          0
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
payer_code                      0
medical_specialty               0
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                          0
diag_2                          0
diag_3                          0
number_diagnoses                0
max_glu_serum               96420
A1Cresult                   84748
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide                   0
glipizide                       0
glyburide                       0
tolbutamide                     0
pioglitazone                    0
rosiglitazone                   0
acarbose                        0
miglitol                        0
troglitazone                    0
tolazamide                      0
examide                         0
citoglipton                     0
insulin                         0
glyburide-metformin             0
glipizide-metformin             0
glimepiride-pioglitazone        0
metformin-rosiglitazone         0
metformin-pioglitazone          0
change                          0
diabetesMed                     0
readmitted                      0
dtype: int64
  • There is no null values
In [29]:
# check if there are any special characters in the data or not
for i in df.columns:
    if df[i].dtype == object:
        print(i,df[i][df[i] == '?'].count())
race 2273
gender 0
age 0
weight 98569
payer_code 40256
medical_specialty 49949
diag_1 21
diag_2 358
diag_3 1423
max_glu_serum 0
A1Cresult 0
metformin 0
repaglinide 0
nateglinide 0
chlorpropamide 0
glimepiride 0
acetohexamide 0
glipizide 0
glyburide 0
tolbutamide 0
pioglitazone 0
rosiglitazone 0
acarbose 0
miglitol 0
troglitazone 0
tolazamide 0
examide 0
citoglipton 0
insulin 0
glyburide-metformin 0
glipizide-metformin 0
glimepiride-pioglitazone 0
metformin-rosiglitazone 0
metformin-pioglitazone 0
change 0
diabetesMed 0
readmitted 0
In [30]:
for i in df.columns:
    print(i,df[i][df[i]=='?'].count()/len(df)*100) # checking in percentage
encounter_id 0.0
patient_nbr 0.0
race 2.2335554114340743
gender 0.0
age 0.0
weight 96.85847925633315
admission_type_id 0.0
discharge_disposition_id 0.0
admission_source_id 0.0
time_in_hospital 0.0
payer_code 39.5574160328597
medical_specialty 49.08220820313268
num_lab_procedures 0.0
num_procedures 0.0
num_medications 0.0
number_outpatient 0.0
number_emergency 0.0
number_inpatient 0.0
diag_1 0.02063557573256294
diag_2 0.3517874339170253
diag_3 1.398305917497003
number_diagnoses 0.0
max_glu_serum 0.0
A1Cresult 0.0
metformin 0.0
repaglinide 0.0
nateglinide 0.0
chlorpropamide 0.0
glimepiride 0.0
acetohexamide 0.0
glipizide 0.0
glyburide 0.0
tolbutamide 0.0
pioglitazone 0.0
rosiglitazone 0.0
acarbose 0.0
miglitol 0.0
troglitazone 0.0
tolazamide 0.0
examide 0.0
citoglipton 0.0
insulin 0.0
glyburide-metformin 0.0
glipizide-metformin 0.0
glimepiride-pioglitazone 0.0
metformin-rosiglitazone 0.0
metformin-pioglitazone 0.0
change 0.0
diabetesMed 0.0
readmitted 0.0
In [31]:
# Displaying the statistical overview of the data
df.describe()
Out[31]:
encounter_id patient_nbr admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses
count 1.017660e+05 1.017660e+05 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000 101766.000000
mean 1.652016e+08 5.433040e+07 2.024006 3.715642 5.754437 4.395987 43.095641 1.339730 16.021844 0.369357 0.197836 0.635566 7.422607
std 1.026403e+08 3.869636e+07 1.445403 5.280166 4.064081 2.985108 19.674362 1.705807 8.127566 1.267265 0.930472 1.262863 1.933600
min 1.252200e+04 1.350000e+02 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000
25% 8.496119e+07 2.341322e+07 1.000000 1.000000 1.000000 2.000000 31.000000 0.000000 10.000000 0.000000 0.000000 0.000000 6.000000
50% 1.523890e+08 4.550514e+07 1.000000 1.000000 7.000000 4.000000 44.000000 1.000000 15.000000 0.000000 0.000000 0.000000 8.000000
75% 2.302709e+08 8.754595e+07 3.000000 4.000000 7.000000 6.000000 57.000000 2.000000 20.000000 0.000000 0.000000 1.000000 9.000000
max 4.438672e+08 1.895026e+08 8.000000 28.000000 25.000000 14.000000 132.000000 6.000000 81.000000 42.000000 76.000000 21.000000 16.000000
In [32]:
df.duplicated().sum()
Out[32]:
0

There are total 50 features looking into all of these:¶¶

In [34]:
#1.encounter_id
df['encounter_id'].nunique()
Out[34]:
101766
In [35]:
has_negative = (df['encounter_id'] < 0).any()
print(has_negative)  
False
In [36]:
df['encounter_id'].dtypes
Out[36]:
dtype('int64')
In [37]:
df['encounter_id'].isnull().sum()
Out[37]:
0
  • Patients that have come to the hospital have been provided with different unique encounter_id .so that's why there are 101766 values.

  • Nothing problem with this column in the point of data clening

In [39]:
#2. patient_nbr
df['patient_nbr'].nunique()
Out[39]:
71518
  • There are 71518 unique paitents that have entered the system
  • encounter_id and patient_nbr these are just identifiers and not useful variables so they can be dropped from our future analysis
In [41]:
# 3 race
df['race'].value_counts()
Out[41]:
race
Caucasian          76099
AfricanAmerican    19210
?                   2273
Hispanic            2037
Other               1506
Asian                641
Name: count, dtype: int64
In [42]:
df['race'].isnull().sum()
Out[42]:
0
In [43]:
df['race'].dtypes
Out[43]:
dtype('O')
  • In 'race' column there are 2273 which is 2.23% value filled with '?'
In [45]:
#4.Gender
df['gender'].value_counts()
Out[45]:
gender
Female             54708
Male               47055
Unknown/Invalid        3
Name: count, dtype: int64
In [46]:
df['gender'].dtypes
Out[46]:
dtype('O')
In [47]:
df['gender'].isnull().sum()
Out[47]:
0
  • In this column there are 3 value filled with 'Unknown/Invalid'
In [49]:
#5.age
df['age'].value_counts()
Out[49]:
age
[70-80)     26068
[60-70)     22483
[50-60)     17256
[80-90)     17197
[40-50)      9685
[30-40)      3775
[90-100)     2793
[20-30)      1657
[10-20)       691
[0-10)        161
Name: count, dtype: int64
In [50]:
df['age'].isnull().sum()
Out[50]:
0
In [51]:
df['age'].dtypes
Out[51]:
dtype('O')
  • It has data type showing 'Object' but it should be 'integer'

  • Also ages are in 10 year interval

In [53]:
# 6.weight
df['weight'].value_counts()
Out[53]:
weight
?            98569
[75-100)      1336
[50-75)        897
[100-125)      625
[125-150)      145
[25-50)         97
[0-25)          48
[150-175)       35
[175-200)       11
>200             3
Name: count, dtype: int64
In [54]:
df['weight'].dtypes
Out[54]:
dtype('O')
  • This column contance 98569 '?' value which is 96.85%
  • It contace 25kg weight interval staring from 0 to 200.
  • Data type 'Object'
In [56]:
# 7.admission_type_id
df['admission_type_id'].value_counts()
Out[56]:
admission_type_id
1    53990
3    18869
2    18480
6     5291
5     4785
8      320
7       21
4       10
Name: count, dtype: int64
In [57]:
df['admission_type_id'].isnull().sum()
Out[57]:
0
In [58]:
df['admission_type_id'].dtypes
Out[58]:
dtype('int64')
  • This columns dont have any problem in the point of datacleaning
In [60]:
# 8.discharge_disposition_id
df['discharge_disposition_id'].value_counts()
Out[60]:
discharge_disposition_id
1     60234
3     13954
6     12902
18     3691
2      2128
22     1993
11     1642
5      1184
25      989
4       815
7       623
23      412
13      399
14      372
28      139
8       108
15       63
24       48
9        21
17       14
16       11
19        8
10        6
27        5
12        3
20        2
Name: count, dtype: int64
In [61]:
df['discharge_disposition_id'].dtypes
Out[61]:
dtype('int64')
In [62]:
df['discharge_disposition_id'].isnull().sum()
Out[62]:
0
  • This columns dont have any problem in the point of datacleaning
In [64]:
# 9.admission_source_id
df['admission_source_id'].value_counts()
Out[64]:
admission_source_id
7     57494
1     29565
17     6781
4      3187
6      2264
2      1104
5       855
3       187
20      161
9       125
8        16
22       12
10        8
14        2
11        2
25        2
13        1
Name: count, dtype: int64
In [65]:
df['admission_source_id'].dtypes
Out[65]:
dtype('int64')
In [66]:
df['admission_source_id'].isnull().sum()
Out[66]:
0
  • This columns dont have any problem in the point of datacleaning
In [68]:
# 10.time_in_hospital
df['time_in_hospital'].value_counts()
Out[68]:
time_in_hospital
3     17756
2     17224
1     14208
4     13924
5      9966
6      7539
7      5859
8      4391
9      3002
10     2342
11     1855
12     1448
13     1210
14     1042
Name: count, dtype: int64
In [69]:
df['time_in_hospital'].isnull().sum()
Out[69]:
0
In [70]:
df['time_in_hospital'].dtypes
Out[70]:
dtype('int64')
  • This columns dont have any problem in the point of datacleaning
In [72]:
# 11. payer_code
df['payer_code'].isnull().sum()
Out[72]:
0
In [73]:
df['payer_code'].dtypes
Out[73]:
dtype('O')
In [74]:
df['payer_code'].value_counts()
Out[74]:
payer_code
?     40256
MC    32439
HM     6274
SP     5007
BC     4655
MD     3532
CP     2533
UN     2448
CM     1937
OG     1033
PO      592
DM      549
CH      146
WC      135
OT       95
MP       79
SI       55
FR        1
Name: count, dtype: int64
  • It contance 40256 (39.55%) value as '?'
In [76]:
# 12. medical_specialty
df['medical_specialty'].value_counts()
Out[76]:
medical_specialty
?                                49949
InternalMedicine                 14635
Emergency/Trauma                  7565
Family/GeneralPractice            7440
Cardiology                        5352
                                 ...  
SportsMedicine                       1
Speech                               1
Perinatology                         1
Neurophysiology                      1
Pediatrics-InfectiousDiseases        1
Name: count, Length: 73, dtype: int64
In [77]:
df['medical_specialty'].isnull().sum()
Out[77]:
0
In [78]:
df['medical_specialty'].dtypes
Out[78]:
dtype('O')
  • It contance 49949 (49.08%) value as '?'
In [80]:
# 13. num_lab_procedures
df['num_lab_procedures'].value_counts()
Out[80]:
num_lab_procedures
1      3208
43     2804
44     2496
45     2376
38     2213
       ... 
120       1
132       1
121       1
126       1
118       1
Name: count, Length: 118, dtype: int64
  • Not any problem in terms of Data Cleaning
In [82]:
# 14. num_procedures
df['num_procedures'].value_counts()
Out[82]:
num_procedures
0    46652
1    20742
2    12717
3     9443
6     4954
4     4180
5     3078
Name: count, dtype: int64
  • Not any problem in terms of Data Cleaning
In [84]:
# 15. num_medications
df['num_medications'].value_counts()
Out[84]:
num_medications
13    6086
12    6004
11    5795
15    5792
14    5707
      ... 
70       2
75       2
81       1
79       1
74       1
Name: count, Length: 75, dtype: int64
  • Not any problem in terms of Data Cleaning
In [86]:
# 16. number_outpatient
df['number_outpatient'].isnull().sum()
Out[86]:
0
In [87]:
df['number_outpatient'].dtypes
Out[87]:
dtype('int64')
In [88]:
df['number_outpatient'].value_counts()
Out[88]:
number_outpatient
0     85027
1      8547
2      3594
3      2042
4      1099
5       533
6       303
7       155
8        98
9        83
10       57
11       42
13       31
12       30
14       28
15       20
16       15
17        8
21        7
20        7
18        5
22        5
19        3
27        3
24        3
26        2
23        2
25        2
33        2
35        2
36        2
29        2
34        1
39        1
42        1
28        1
37        1
38        1
40        1
Name: count, dtype: int64
  • 85027 values filled with 0
In [90]:
# 17. number_emergency
df['number_emergency'].value_counts()
Out[90]:
number_emergency
0     90383
1      7677
2      2042
3       725
4       374
5       192
6        94
7        73
8        50
10       34
9        33
11       23
13       12
12       10
22        6
16        5
18        5
19        4
20        4
15        3
14        3
25        2
21        2
28        1
42        1
46        1
76        1
37        1
64        1
63        1
54        1
24        1
29        1
Name: count, dtype: int64
In [91]:
df['number_emergency'].isnull().sum()
Out[91]:
0
In [92]:
df['number_emergency'].dtypes
Out[92]:
dtype('int64')
  • 90383 values filled with 0
In [94]:
#18. number_inpatient
df['number_inpatient'].value_counts()
Out[94]:
number_inpatient
0     67630
1     19521
2      7566
3      3411
4      1622
5       812
6       480
7       268
8       151
9       111
10       61
11       49
12       34
13       20
14       10
15        9
16        6
19        2
17        1
21        1
18        1
Name: count, dtype: int64
In [95]:
df['number_inpatient'].dtypes
Out[95]:
dtype('int64')
In [96]:
df['number_inpatient'].isnull().sum()
Out[96]:
0
In [97]:
#19. diag_1
df['diag_1'].value_counts()
Out[97]:
diag_1
428    6862
414    6581
786    4016
410    3614
486    3508
       ... 
373       1
314       1
684       1
217       1
V51       1
Name: count, Length: 717, dtype: int64
In [98]:
# Count the values in 'diag_1' that start with any letter of the English alphabet
count_alpha_start = df['diag_1'].str.match('^[A-Za-z]').sum()

print(f"Number of values starting with a letter: {count_alpha_start}")
Number of values starting with a letter: 1645
In [99]:
count_values_starting_with_V = df['diag_1'].str.startswith('V').sum()

print("Number of values starting with 'V':", count_values_starting_with_V)
Number of values starting with 'V': 1644
In [100]:
count_values_starting_with_E = df['diag_1'].str.startswith('E').sum()

print("Number of values starting with 'E':", count_values_starting_with_E)
Number of values starting with 'E': 1
In [101]:
df['diag_1'].isnull().sum()
Out[101]:
0
In [102]:
df['diag_1'].dtypes
Out[102]:
dtype('O')
  • Data types is object
  • 21 value filled with '?'
  • There are 1644 values start with 'V'
  • There are 1 values start with 'E'
  • Some decimal values is there
In [104]:
#20. diag_2
df['diag_2'].value_counts()
Out[104]:
diag_2
276     6752
428     6662
250     6071
427     5036
401     3736
        ... 
E918       1
46         1
V13        1
E850       1
927        1
Name: count, Length: 749, dtype: int64
In [105]:
# Count the values in 'diag_2' that start with any letter of the English alphabet
count_alpha_start = df['diag_2'].str.match('^[A-Za-z]').sum()

print(f"Number of values starting with a letter: {count_alpha_start}")
Number of values starting with a letter: 2536
In [106]:
count_values_starting_with_V = df['diag_2'].str.startswith('V').sum()

print("Number of values starting with 'V':", count_values_starting_with_V)
Number of values starting with 'V': 1805
In [107]:
count_values_starting_with_E = df['diag_2'].str.startswith('E').sum()

print("Number of values starting with 'E':", count_values_starting_with_E)
Number of values starting with 'E': 731
In [108]:
df['diag_2'].isnull().sum()
Out[108]:
0
In [109]:
df['diag_2'].dtypes
Out[109]:
dtype('O')
  • Data types is object
  • 358 value filled with '?'
  • There are 1805 values start with 'V'
  • There are 731 values start with 'E'
  • Some decimal values is there
In [111]:
# 21. Diag_3
df['diag_3'].value_counts()
Out[111]:
diag_3
250     11555
401      8289
276      5175
428      4577
427      3955
        ...  
657         1
684         1
603         1
E826        1
971         1
Name: count, Length: 790, dtype: int64
In [112]:
df['diag_3'].dtypes
Out[112]:
dtype('O')
In [113]:
df['diag_3'].isnull().sum()
Out[113]:
0
In [114]:
# Count the values in 'diag_3' that start with any letter of the English alphabet
count_alpha_start = df['diag_3'].str.match('^[A-Za-z]').sum()

print(f"Number of values starting with a letter: {count_alpha_start}")
Number of values starting with a letter: 5058
In [115]:
count_values_starting_with_V = df['diag_3'].str.startswith('V').sum()

print("Number of values starting with 'V':", count_values_starting_with_V)
Number of values starting with 'V': 3814
In [116]:
count_values_starting_with_E = df['diag_3'].str.startswith('E').sum()

print("Number of values starting with 'E':", count_values_starting_with_E)
Number of values starting with 'E': 1244
  • Data types is object
  • 1423 value filled with '?'
  • There are 3814 values start with 'V'
  • There are 1244 values start with 'E'
  • Some decimal values is there
In [118]:
# 22.number_diagnoses
df['number_diagnoses'].value_counts()
Out[118]:
number_diagnoses
9     49474
5     11393
8     10616
7     10393
6     10161
4      5537
3      2835
2      1023
1       219
16       45
10       17
13       16
11       11
15       10
12        9
14        7
Name: count, dtype: int64
In [119]:
df['number_diagnoses'].dtypes
Out[119]:
dtype('int64')
In [120]:
df['number_diagnoses'].isnull().sum()
Out[120]:
0
  • There is no change required in terms of data cleaning
In [122]:
# 23. max_glu_serum
df['max_glu_serum'].value_counts()
Out[122]:
max_glu_serum
Norm    2597
>200    1485
>300    1264
Name: count, dtype: int64
In [123]:
df['max_glu_serum'].isnull().sum()
Out[123]:
96420
In [124]:
df['max_glu_serum'].dtypes
Out[124]:
dtype('O')
  • There is no change required in terms of data cleaning
In [126]:
# 24. A1Cresult
df['A1Cresult'].value_counts()
Out[126]:
A1Cresult
>8      8216
Norm    4990
>7      3812
Name: count, dtype: int64
In [127]:
df['A1Cresult'].isnull().sum()
Out[127]:
84748
In [128]:
df['A1Cresult'].dtypes
Out[128]:
dtype('O')
  • There is no change required in terms of data cleaning
In [130]:
# 25. metformin
df['metformin'].value_counts()
Out[130]:
metformin
No        81778
Steady    18346
Up         1067
Down        575
Name: count, dtype: int64
In [131]:
df['metformin'].dtypes
Out[131]:
dtype('O')
In [132]:
# 26.repaglinide
df['repaglinide'].value_counts()
Out[132]:
repaglinide
No        100227
Steady      1384
Up           110
Down          45
Name: count, dtype: int64
  • No Change required in terms of data cleaning
In [134]:
# 27.nateglinide
df['nateglinide'].value_counts()
Out[134]:
nateglinide
No        101063
Steady       668
Up            24
Down          11
Name: count, dtype: int64
  • No Change required in terms of data cleaning
In [136]:
# 28. chlorpropamide
df['chlorpropamide'].value_counts()
Out[136]:
chlorpropamide
No        101680
Steady        79
Up             6
Down           1
Name: count, dtype: int64
  • No Change required in terms of data cleaning
In [138]:
# 29. glimepiride
df['glimepiride'].value_counts()
Out[138]:
glimepiride
No        96575
Steady     4670
Up          327
Down        194
Name: count, dtype: int64
  • No Change required in terms of data cleaning
In [140]:
# 30. acetohexamide
df['acetohexamide'].value_counts()
Out[140]:
acetohexamide
No        101765
Steady         1
Name: count, dtype: int64
  • In this case acetohexamide is given to only one patient.
In [142]:
# 31. glipizide
df['glipizide'].value_counts()
Out[142]:
glipizide
No        89080
Steady    11356
Up          770
Down        560
Name: count, dtype: int64
  • no change tequired in terms of data cleaning
In [144]:
# 32. glyburide
df['glyburide'].value_counts()
Out[144]:
glyburide
No        91116
Steady     9274
Up          812
Down        564
Name: count, dtype: int64
  • No change required in terms of data cleaning
In [146]:
# 33. tolbutamide
df['tolbutamide'].value_counts()
Out[146]:
tolbutamide
No        101743
Steady        23
Name: count, dtype: int64
  • Only 23 patient this medicine was given
In [148]:
# 34. pioglitazone
df['pioglitazone'].value_counts()
Out[148]:
pioglitazone
No        94438
Steady     6976
Up          234
Down        118
Name: count, dtype: int64
  • No change required in terms of data clening
In [150]:
# 35. rosiglitazone
df['rosiglitazone'].value_counts()
Out[150]:
rosiglitazone
No        95401
Steady     6100
Up          178
Down         87
Name: count, dtype: int64
  • No change required in terms of data cleaning
In [152]:
# 36. acarbose
df['acarbose'].value_counts()
Out[152]:
acarbose
No        101458
Steady       295
Up            10
Down           3
Name: count, dtype: int64
  • only 308 patients was taken this medicine
In [154]:
# 37. miglitol
df['miglitol'].value_counts()
Out[154]:
miglitol
No        101728
Steady        31
Down           5
Up             2
Name: count, dtype: int64
  • only 38 patients was taken this medicine
In [156]:
# 38. troglitazone
df['troglitazone'].value_counts()
Out[156]:
troglitazone
No        101763
Steady         3
Name: count, dtype: int64
  • only 3 patients was taken this medicine
In [158]:
# 39.tolazamide
df['tolazamide'].value_counts()
Out[158]:
tolazamide
No        101727
Steady        38
Up             1
Name: count, dtype: int64
  • only 39 patients was taken this medicine
In [160]:
# 40. examide
df['examide'].value_counts()
Out[160]:
examide
No    101766
Name: count, dtype: int64
  • no one taken this medicine
In [162]:
#41. citoglipton
df['citoglipton'].value_counts()
Out[162]:
citoglipton
No    101766
Name: count, dtype: int64
  • no one taken this medicine
In [164]:
# 42. insulin
df['insulin'].value_counts()
Out[164]:
insulin
No        47383
Steady    30849
Down      12218
Up        11316
Name: count, dtype: int64
  • Nothing to change for this column as per data clening
In [166]:
# 43. glyburide-metformin
df['glyburide-metformin'].value_counts()
Out[166]:
glyburide-metformin
No        101060
Steady       692
Up             8
Down           6
Name: count, dtype: int64
  • only 706 patient was taken this medicine
In [168]:
# 44. glipizide-metformin
df['glipizide-metformin'].value_counts()
Out[168]:
glipizide-metformin
No        101753
Steady        13
Name: count, dtype: int64
  • only 13 patient was taken this medicine
In [170]:
# 45. glimepiride-pioglitazone
df['glimepiride-pioglitazone'].value_counts()
Out[170]:
glimepiride-pioglitazone
No        101765
Steady         1
Name: count, dtype: int64
  • only one patient was taken this medicine
In [172]:
# 46. metformin-rosiglitazone
df['metformin-rosiglitazone'].value_counts()
Out[172]:
metformin-rosiglitazone
No        101764
Steady         2
Name: count, dtype: int64
  • only 2 patient was taken this medicine
In [174]:
# 47. metformin-pioglitazone
df['metformin-pioglitazone'].value_counts()
Out[174]:
metformin-pioglitazone
No        101765
Steady         1
Name: count, dtype: int64
  • only one patient was taken this medicine
In [176]:
# 48. change
df['change'].value_counts()
Out[176]:
change
No    54755
Ch    47011
Name: count, dtype: int64
  • Nothing to change for this column in terms of data clening
In [178]:
# 49.diabetesMed
df['diabetesMed'].value_counts()
Out[178]:
diabetesMed
Yes    78363
No     23403
Name: count, dtype: int64
  • no need to change anything for this columns
In [180]:
# 50. readmitted
df['readmitted'].value_counts()
Out[180]:
readmitted
NO     54864
>30    35545
<30    11357
Name: count, dtype: int64
  • No need to change anything for this columns in the point of data cleaning

Issues with the Dataset¶

  1. 'encounter_id' and 'patient_nbr' both are patient identifier

  2. In 'race' column there are 2273 which is 2.23% value filled with '?'

  3. In 'gender' column there are 3 value filled with ''Unknown/Invalid'

  4. In 'age' column the dataset only gives us age as 10 year interval(from 0 to 100). so we don't know the exact age of each patient and data type also object.

    • This column contance 98569 '?' value which is 96.85%
  • It contace 25kg weight interval staring from 0 to 200.
  • Data type 'Object'
  1. In 'payer_code' column .It contance 40256 (39.55%) value as '?'

  2. In 'medical_specialty' column 49949 (49.08%) value as '?'

  3. In 'number_outpatient' column 85027 values filled with 0

  4. In 'number_emergency' column 90383 values filled with 0

  5. In 'diag_1' column

  • Data types is object
  • 21 value filled with '?'
  • There are 1644 values start with 'V'
  • There are 1 values start with 'E'
  • Some decimal values is there
  1. In 'diag_2' column
  • Data types is object
  • 358 value filled with '?'
  • There are 1805 values start with 'V'
  • There are 731 values start with 'E'
  • Some decimal values is there
  1. In 'diag_3' column
  • Data types is object
  • 1423 value filled with '?'
  • There are 3814 values start with 'V'
  • There are 1244 values start with 'E'
  • Some decimal values is there
  1. 'acetohexamide' is give to only one patient

  2. 'tolbutamide' Only 23 patient this medicine was given

  3. 'acarbose' only 308 patient wa taken this medicine

  4. 'miglitol' only 38 patient was given to this medicine

  5. 'troglitazone' only 3 patients was taken this medicine

  6. 'tolazamide' only 39 patient was taken to this medicine

  7. 'examide' no one taken this medicine

  8. 'citoglipton' no one taken this medicine

  9. 'glyburide-metformin' only 706 patient was taken this medicine

  10. 'glipizide-metformin' only 13 patient was taken this medicine

  11. 'glimepiride-pioglitazone' only one patient was taken this medicine

  12. 'metformin-rosiglitazone' only 2 patient was taken this medicine

  13. 'metformin-pioglitazone' only one patient was taken this medicine

(C) - Data Cleaning¶

In [185]:
# Make a copy of the dataset
patients_df = df.copy
  • In this dataset missing values are represented as "?" sign
  • weight has 96.85%,payer_code has 39.55% and medical_specialty has 49.08% missing values that means we can drop this columns
In [187]:
# dropping columns with large number of missing values
df = df.drop(['weight','payer_code','medical_specialty'], axis=1)
In [188]:
df.shape
Out[188]:
(101766, 47)

There are total 50 features looking into all of these:¶¶

In [190]:
#1.encounter_id
df['encounter_id'].nunique()
Out[190]:
101766
  • Patients that have come to the hospital have been provided with different unique encounter_id .so that's why there are 101766 values.
In [192]:
#2. patient_nbr
df['patient_nbr'].nunique()
Out[192]:
71518
  • There are 71518 unique paitents that have entered the system
  • encounter_id and patient_nbr these are just identifiers and not useful variables so they can be dropped from our future analysis
In [194]:
#droping not useful variables
df.drop(['encounter_id','patient_nbr'],axis=1,inplace=True)
In [195]:
# 3 race
df['race'].value_counts()
Out[195]:
race
Caucasian          76099
AfricanAmerican    19210
?                   2273
Hispanic            2037
Other               1506
Asian                641
Name: count, dtype: int64
  • There are unknown values in the form of '?' which we can replaced with nan values.so that later these values can be filled
In [197]:
df['race'] = df['race'].replace('?',np.nan)
In [198]:
df['race'].unique()
Out[198]:
array(['Caucasian', 'AfricanAmerican', nan, 'Other', 'Asian', 'Hispanic'],
      dtype=object)
In [199]:
df['race']=df['race'].fillna(df['race'].mode()[0]) #replacing with mode
In [200]:
df['race'].value_counts()
Out[200]:
race
Caucasian          78372
AfricanAmerican    19210
Hispanic            2037
Other               1506
Asian                641
Name: count, dtype: int64
In [201]:
#4.Gender
df['gender'].value_counts()
Out[201]:
gender
Female             54708
Male               47055
Unknown/Invalid        3
Name: count, dtype: int64
  • There is a 3rd category of unknown/invalid which are missing/unknown values as there count is less so they can be dropped
In [203]:
df['gender']=df['gender'].replace('Unknown/Invalid',np.nan)
df['gender'].dropna(inplace=True)
In [204]:
df['gender'].value_counts()
Out[204]:
gender
Female    54708
Male      47055
Name: count, dtype: int64
In [205]:
df['gender'].unique()
Out[205]:
array(['Female', 'Male', nan], dtype=object)
In [206]:
df['gender'] = df['gender'].fillna(df['gender'].mode()[0])
In [207]:
df['gender'].value_counts()
Out[207]:
gender
Female    54711
Male      47055
Name: count, dtype: int64
In [208]:
df.replace('?',np.nan,inplace=True)
In [209]:
df.isnull().sum()
Out[209]:
race                            0
gender                          0
age                             0
admission_type_id               0
discharge_disposition_id        0
admission_source_id             0
time_in_hospital                0
num_lab_procedures              0
num_procedures                  0
num_medications                 0
number_outpatient               0
number_emergency                0
number_inpatient                0
diag_1                         21
diag_2                        358
diag_3                       1423
number_diagnoses                0
max_glu_serum               96420
A1Cresult                   84748
metformin                       0
repaglinide                     0
nateglinide                     0
chlorpropamide                  0
glimepiride                     0
acetohexamide                   0
glipizide                       0
glyburide                       0
tolbutamide                     0
pioglitazone                    0
rosiglitazone                   0
acarbose                        0
miglitol                        0
troglitazone                    0
tolazamide                      0
examide                         0
citoglipton                     0
insulin                         0
glyburide-metformin             0
glipizide-metformin             0
glimepiride-pioglitazone        0
metformin-rosiglitazone         0
metformin-pioglitazone          0
change                          0
diabetesMed                     0
readmitted                      0
dtype: int64
In [210]:
#5.age
df['age'].value_counts()
Out[210]:
age
[70-80)     26068
[60-70)     22483
[50-60)     17256
[80-90)     17197
[40-50)      9685
[30-40)      3775
[90-100)     2793
[20-30)      1657
[10-20)       691
[0-10)        161
Name: count, dtype: int64
  • It shows the age range between 0 to 100
  • Most of the population come under 50 to 80
  • Dealing with age.The dataset only gives us age as 10 year interval. so we don't know the exact age of each patient
  • We assume age of the patient on average lies at the midpoint of the age category.For eg: If the patients age category is 40-50 years, Then we assume the age = 45 years.
  • So we converted age categories to midpoints
In [212]:
df['age'].unique()
Out[212]:
array(['[0-10)', '[10-20)', '[20-30)', '[30-40)', '[40-50)', '[50-60)',
       '[60-70)', '[70-80)', '[80-90)', '[90-100)'], dtype=object)
In [213]:
df['age']=df['age'].replace({'[0-10)':5,'[10-20)':15,'[20-30)':25,'[30-40)':35,'[40-50)':45,'[50-60)':55,
 '[60-70)':65,'[70-80)':75,'[80-90)':85,'[90-100)':95})     
In [214]:
df.age.value_counts()
Out[214]:
age
75    26068
65    22483
55    17256
85    17197
45     9685
35     3775
95     2793
25     1657
15      691
5       161
Name: count, dtype: int64
In [215]:
#6.admission_type_id
df['admission_type_id'].value_counts()
Out[215]:
admission_type_id
1    53990
3    18869
2    18480
6     5291
5     4785
8      320
7       21
4       10
Name: count, dtype: int64
In [216]:
#7.max_glu_serum 
df.max_glu_serum.value_counts()
Out[216]:
max_glu_serum
Norm    2597
>200    1485
>300    1264
Name: count, dtype: int64
In [217]:
df.max_glu_serum.unique()
Out[217]:
array([nan, '>300', 'Norm', '>200'], dtype=object)
In [218]:
df['max_glu_serum'].fillna('Test Not Done', inplace=True)
In [219]:
df.max_glu_serum.value_counts()
Out[219]:
max_glu_serum
Test Not Done    96420
Norm              2597
>200              1485
>300              1264
Name: count, dtype: int64
In [220]:
# A1Cresult  
df.A1Cresult.value_counts()
Out[220]:
A1Cresult
>8      8216
Norm    4990
>7      3812
Name: count, dtype: int64
In [221]:
df.A1Cresult.unique()
Out[221]:
array([nan, '>7', '>8', 'Norm'], dtype=object)
In [222]:
df['A1Cresult'].fillna('Test Not Done', inplace=True)
In [223]:
df.A1Cresult.value_counts()
Out[223]:
A1Cresult
Test Not Done    84748
>8                8216
Norm              4990
>7                3812
Name: count, dtype: int64
In [224]:
df.shape
Out[224]:
(101766, 45)

EDA¶

Handling outliers¶

In [227]:
# Create a boxplot for each numerical column
plt.figure(figsize=(10, 8))
sns.boxplot(data=df)
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()
No description has been provided for this image

Preserving outliers in these columns is vital as they exhibit a strong correlation with the target variable in healthcare analytics. Outliers in healthcare data often represent extreme cases or rare conditions, which hold significant clinical relevance and predictive power. Removing them risks distorting the true nature of patient outcomes, treatment effectiveness, or disease progression, compromising the accuracy and interpretability of predictive models. By retaining these outliers, we ensure the model captures the full spectrum of patient experiences, enabling more nuanced insights, precise risk assessment, and tailored interventions, ultimately - enhancing the quality of care delivery and patient outcomes in clinical practice.

Feature engineering¶

Importing ID's mapping CSV¶

In [231]:
ID_map=pd.read_csv('IDs_mapping.csv')
In [232]:
ID_map.loc[:,['admission_type_id','description']].head(8)
Out[232]:
admission_type_id description
0 1.0 Emergency
1 2.0 Urgent
2 3.0 Elective
3 4.0 Newborn
4 5.0 Not Available
5 6.0 NaN
6 7.0 Trauma Center
7 8.0 Not Mapped

According to ID's mapping¶

admission_type_id description:

  • These 8 values correspond to:

1.Emergency 2.Urgent 3.Elective 4.Newborn 5.Not Avallable 6.NULL 7.Trauma Center 8.Not Mapped

  • From this description we can see that the majority patients that heve been admitted in this hospital have an admission type as emergency.

  • For future analysis we can merged these categories into less number of categories.

  • Emergency/Urgent/Trauma Center-To-Emergency

  • Not Available/Null/Not Mapped-TO-Not Avaliable

  • Elective & New Born remain same

In [235]:
df['admission_type_id']=df['admission_type_id'].replace({1:'Emergency',2:'Emergency',7:'Emergency',5:'Not Available',6:'Not Available',8:'Not Available',3:'Elective',4:'Newborn'})
In [236]:
df['admission_type_id'].value_counts()
Out[236]:
admission_type_id
Emergency        72491
Elective         18869
Not Available    10396
Newborn             10
Name: count, dtype: int64
  • From admission_type_id dropping Newborn category as its not giving correct information if we compare it with the age
In [238]:
df[df.loc[:,'admission_type_id'] == 'Newborn']
Out[238]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications ... citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
2043 Caucasian Male 55 Newborn 1 4 2 56 2 13 ... No Steady No No No No No No Yes NO
2203 Caucasian Male 85 Newborn 1 4 2 46 6 14 ... No Steady No No No No No No Yes >30
2461 AfricanAmerican Male 75 Newborn 6 4 5 73 1 12 ... No Up No No No No No Ch Yes NO
4823 Caucasian Female 65 Newborn 6 4 2 39 0 7 ... No Steady No No No No No No Yes <30
35877 AfricanAmerican Male 65 Newborn 1 14 3 66 0 18 ... No Steady No No No No No No Yes NO
47548 Hispanic Male 45 Newborn 1 7 6 39 4 7 ... No No No No No No No No No NO
48711 Caucasian Female 75 Newborn 6 1 4 37 1 18 ... No No No No No No No Ch Yes >30
80354 Hispanic Female 5 Newborn 1 1 2 27 0 2 ... No No No No No No No No No NO
87714 Caucasian Male 65 Newborn 1 4 3 28 0 15 ... No Steady Steady No No No No Ch Yes NO
100721 Asian Male 85 Newborn 6 7 3 51 0 10 ... No No No No No No No No Yes NO

10 rows × 45 columns

In [239]:
df.drop(df.index[df['admission_type_id'] == 'Newborn'],inplace = True)
In [240]:
df['admission_type_id'].value_counts()
Out[240]:
admission_type_id
Emergency        72491
Elective         18869
Not Available    10396
Name: count, dtype: int64
In [241]:
#7.discharge_disposition_id:
df['discharge_disposition_id'].value_counts()
Out[241]:
discharge_disposition_id
1     60228
3     13954
6     12898
18     3691
2      2128
22     1993
11     1642
5      1184
25      989
4       815
7       623
23      412
13      399
14      372
28      139
8       108
15       63
24       48
9        21
17       14
16       11
19        8
10        6
27        5
12        3
20        2
Name: count, dtype: int64
In [242]:
df['discharge_disposition_id'].unique()
Out[242]:
array([25,  1,  3,  6,  2,  5, 11,  7, 10,  4, 14, 18,  8, 13, 12, 16, 17,
       22, 23,  9, 20, 15, 24, 28, 19, 27], dtype=int64)

According to ID's mapping:¶

  • discharge_disposition_id description:
In [245]:
ID_map.loc[:,['discharge_disposition_id','description1']].head(30)
Out[245]:
discharge_disposition_id description1
0 1 Discharged to home
1 2 Discharged/transferred to another short term h...
2 3 Discharged/transferred to SNF
3 4 Discharged/transferred to ICF
4 5 Discharged/transferred to another type of inpa...
5 6 Discharged/transferred to home with home healt...
6 7 Left AMA
7 8 Discharged/transferred to home under care of H...
8 9 Admitted as an inpatient to this hospital
9 10 Neonate discharged to another hospital for neo...
10 11 Expired
11 12 Still patient or expected to return for outpat...
12 13 Hospice / home
13 14 Hospice / medical facility
14 15 Discharged/transferred within this institution...
15 16 Discharged/transferred/referred another instit...
16 17 Discharged/transferred/referred to this instit...
17 18 NaN
18 19 Expired at home. Medicaid only, hospice.
19 20 Expired in a medical facility. Medicaid only, ...
20 21 Expired, place unknown. Medicaid only, hospice.
21 22 Discharged/transferred to another rehab fac in...
22 23 Discharged/transferred to a long term care hos...
23 24 Discharged/transferred to a nursing facility c...
24 25 Not Mapped
25 26 Unknown/Invalid
26 30 Discharged/transferred to another Type of Heal...
27 27 Discharged/transferred to a federal health car...
28 28 Discharged/transferred/referred to a psychiatr...
29 29 Discharged/transferred to a Critical Access Ho...

We are merging several categories of discharge_diposition_id into fewer categories:¶

  • discharge_disposition_id: 1,6,8-TO-Discharge to home
  • discharge_disposition_id: 2,3,4,5,16,22,23,24,27,28,29,30-TO-Transferred to another facility
  • discharge_disposition_id: 7-TO-Left AMA (Against Medical Advice)
  • discharge_disposition_id: 9,12,15,17-TO-Still patient/referred to this institution
  • discharge_disposition_id: 10 -TO-Neonate discharged
  • discharge_disposition_id: 11,19,20,21 -TO-Expired
  • discharge_disposition_id: 13,14 -TO-Hospice
  • discharge_disposition_id: 18,25,26 -TO-Not Available
In [248]:
df['discharge_disposition_id']=df['discharge_disposition_id'].replace({1:'Discharged to home',
6:'Discharged to home',8:'Discharged to home',2:'Transferred to another facility',
3:'Transferred to another facility',4:'Transferred to another facility',5:'Transferred to another facility',
16:'Transferred to another facility',22:'Transferred to another facility',23:'Transferred to another facility',
24:'Transferred to another facility',27:'Transferred to another facility',28:'Transferred to another facility',
29:'Transferred to another facility',30:'Transferred to another facility',7:'Left AMA',
9:'Still patient/referred to this institution',12:'Still patient/referred to this institution',
15:'Still patient/referred to this institution',17:'Still patient/referred to this institution',
10:'Neonate discharged',11:'Expired',19:'Expired',20:'Expired',21:'Expired',13:'Hospice',14:'Hospice',
18:'Not Available',25:'Not Available',26:'Not Available'})
In [249]:
df['discharge_disposition_id'].value_counts()
Out[249]:
discharge_disposition_id
Discharged to home                            73234
Transferred to another facility               20689
Not Available                                  4680
Expired                                        1652
Hospice                                         771
Left AMA                                        623
Still patient/referred to this institution      101
Neonate discharged                                6
Name: count, dtype: int64
  • The patients who have expired,Hospice and Neonate(newborn child)should be removed from the column because they will create bias as they will not be readmitted
In [251]:
df = df[df['discharge_disposition_id']!='Expired']
df = df[df['discharge_disposition_id']!='Neonate discharged']
df = df[df['discharge_disposition_id']!='Hospice']
In [252]:
df['discharge_disposition_id'].value_counts()
Out[252]:
discharge_disposition_id
Discharged to home                            73234
Transferred to another facility               20689
Not Available                                  4680
Left AMA                                        623
Still patient/referred to this institution      101
Name: count, dtype: int64
In [253]:
#8 admission_source_id:
df['admission_source_id'].value_counts()
Out[253]:
admission_source_id
7     55845
1     29163
17     6570
4      3113
6      2239
2      1081
5       806
3       185
20      159
9       125
8        15
22       12
10        8
11        2
25        2
14        1
13        1
Name: count, dtype: int64
In [254]:
df['admission_source_id'].nunique()
Out[254]:
17

According to ID's mapping:¶

admission_source_id description:

In [257]:
ID_map.loc[:,['admission_source_id','description2']].head(25)
Out[257]:
admission_source_id description2
0 1.0 Physician Referral
1 2.0 Clinic Referral
2 3.0 HMO Referral
3 4.0 Transfer from a hospital
4 5.0 Transfer from a Skilled Nursing Facility (SNF)
5 6.0 Transfer from another health care facility
6 7.0 Emergency Room
7 8.0 Court/Law Enforcement
8 9.0 Not Available
9 10.0 Transfer from critial access hospital
10 11.0 Normal Delivery
11 12.0 Premature Delivery
12 13.0 Sick Baby
13 14.0 Extramural Birth
14 15.0 Not Available
15 17.0 NaN
16 18.0 Transfer From Another Home Health Agency
17 19.0 Readmission to Same Home Health Agency
18 20.0 Not Mapped
19 21.0 Unknown/Invalid
20 22.0 Transfer from hospital inpt/same fac reslt in...
21 23.0 Born inside this hospital
22 24.0 Born outside this hospital
23 25.0 Transfer from Ambulatory Surgery Center
24 26.0 Transfer from Hospice

We are merging several categories of admission_source_id into fewer categories:¶

  • admission_source_id: 1,2,3 -TO-Referral
  • admission_source_id: 4,5,6,10,18,19,22,25,26 -TO-Transferred from hospital(another care facility)
  • admission_source_id: 7,8 -TO-Emergency
  • admission_source_id: 9,15,17,20,21 -TO-Not Available
  • admission_source_id: 11,12,14,13,23,24 -TO-Delivery
In [260]:
df['admission_source_id']=df['admission_source_id'].replace({1:'Referral',2:'Referral',3:'Referral',4:'Transferred from hospital',5:'Transferred from hospital',6:'Transferred from hospital',10:'Transferred from hospital',18:'Transferred from hospital',19:'Transferred from hospital',25:'Transferred from hospital',26:'Transferred from hospital',22:'Transferred from hospital',7:'Emergency',8:'Emergency',9:'Not Available',15:'Not Available',17:'Not Available',20:'Not Available',21:'Not Available',11:'Delivery',12:'Delivery', 13:'Delivery',14:'Delivery',23:'Delivery',24:'Delivery'})
In [261]:
df['admission_source_id'].value_counts()
Out[261]:
admission_source_id
Emergency                    55860
Referral                     30429
Not Available                 6854
Transferred from hospital     6180
Delivery                         4
Name: count, dtype: int64
  • Dropping the category delivery because the admission_source_id == delivery and the age corresponding to it give contradictory information
In [263]:
df=df[df['admission_source_id']!='Delivery']
In [264]:
df['admission_source_id'].value_counts()
Out[264]:
admission_source_id
Emergency                    55860
Referral                     30429
Not Available                 6854
Transferred from hospital     6180
Name: count, dtype: int64
In [265]:
#9.time_in_hospital
df['time_in_hospital'].value_counts(sort=False)
Out[265]:
time_in_hospital
1     13824
3     17425
2     16886
4     13682
5      9747
13     1152
12     1383
9      2878
7      5696
10     2262
6      7354
11     1770
8      4269
14      995
Name: count, dtype: int64
  • The time in the hospital ranges from 1 day to 14 days
In [267]:
#10.num_lab_procedures
df['num_lab_procedures'].unique()
Out[267]:
array([ 41,  59,  11,  44,  51,  31,  70,  73,  68,  33,  47,  62,  60,
        55,  49,  75,  45,  29,  35,  42,  66,  36,  19,  64,  25,  53,
        52,  87,  27,  37,  28,  48,  10,   2,  65,  67,  40,  54,  58,
        57,  43,  32,  83,  34,  39,  69,  38,  72,  22,  96,  46,  78,
        56,  61,  88,  50,   1,  18,  82,   9,  63,  24,  71,  77,  81,
        76,  90,  93,   3, 103,  13,  80,  85,  16,  15,  12,  30,  23,
        17,  21,  79,  26,   5,  95,  97,  84,  14,  74, 105,  86,  98,
        20,   6,  94,   8, 102,   7,  89,  91,  92,   4, 101,  99, 100,
       114, 113, 111, 129, 107, 108, 106, 104, 109, 120, 132, 121, 126,
       118], dtype=int64)
In [268]:
print('The number of lab tests a patient has undergone can range from',df.num_lab_procedures.min(),'to',df.num_lab_procedures.max())
The number of lab tests a patient has undergone can range from 1 to 132
In [269]:
11#num_procedures:
df.num_procedures.value_counts()
Out[269]:
num_procedures
0    45672
1    20247
2    12367
3     9202
6     4799
4     4048
5     2988
Name: count, dtype: int64

No of procedures range from 0 to 5

In [271]:
#12. num_medication:
print('The number of medication range from a minimum of',df.num_medications.min(),'to maximum of',df.num_medications.max())
The number of medication range from a minimum of 1 to maximum of 81
  • The number of medications range from a minimum of 1 to maximum of 81

1.number_outpatient: Number of outpatient visits of the patient in the year preceding the encounter

2.number_emergency: Number of emergency visits of the patient in the year preceding the encounter

3.number_inpatient: Number of inpatient visits of the patient in the year preceding the encounter

In [273]:
df['number_outpatient'].value_counts()
Out[273]:
number_outpatient
0     82976
1      8348
2      3509
3      1996
4      1077
5       514
6       294
7       154
8        96
9        82
10       57
11       41
13       30
12       29
14       28
15       20
16       13
17        8
21        6
20        6
18        5
22        5
19        3
27        3
24        3
26        2
23        2
33        2
35        2
36        2
29        2
39        1
34        1
42        1
25        1
28        1
37        1
38        1
40        1
Name: count, dtype: int64
In [274]:
df['number_emergency'].value_counts()
Out[274]:
number_emergency
0     88229
1      7474
2      1984
3       706
4       369
5       190
6        93
7        72
8        50
10       34
9        33
11       23
13       12
12       10
22        6
16        5
18        5
19        4
20        4
15        3
14        3
25        2
21        2
28        1
42        1
46        1
76        1
37        1
64        1
63        1
54        1
24        1
29        1
Name: count, dtype: int64
In [275]:
df['number_inpatient'].value_counts()
Out[275]:
number_inpatient
0     66230
1     18980
2      7299
3      3271
4      1574
5       791
6       474
7       262
8       145
9       109
10       59
11       49
12       32
13       18
14       10
15        9
16        6
19        2
17        1
21        1
18        1
Name: count, dtype: int64

Service utilization: The data contains variables for number of inpatient(admissions).Emergency room visits and outpatient visits for a given patient in the previous one year.These are (crude) measures of how much hospital/clinic services a person has used in the past year.we added these three to create a new variable called service utilization(sea figure below). The idea was to see which version gives us better results.Granted,we did not apply any special weighting to the three ingredients of service utilization but we wanted to try something simple at this stage. so we combined these features into a new variable called Service utilization

In [277]:
df['service_utilization'] = df['number_outpatient']+df['number_emergency']+df['number_inpatient'] #year_visits
df.drop(['number_outpatient','number_emergency','number_inpatient'],axis=1,inplace=True)
In [278]:
df.columns
Out[278]:
Index(['race', 'gender', 'age', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
       'num_lab_procedures', 'num_procedures', 'num_medications', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted',
       'service_utilization'],
      dtype='object')
In [279]:
df.head()
Out[279]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications ... insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted service_utilization
0 Caucasian Female 5 Not Available Not Available Referral 1 41 0 1 ... No No No No No No No No NO 0
1 Caucasian Female 15 Emergency Discharged to home Emergency 3 59 0 18 ... Up No No No No No Ch Yes >30 0
2 AfricanAmerican Female 25 Emergency Discharged to home Emergency 2 11 5 13 ... No No No No No No No Yes NO 3
3 Caucasian Male 35 Emergency Discharged to home Emergency 2 44 1 16 ... Up No No No No No Ch Yes NO 0
4 Caucasian Male 45 Emergency Discharged to home Emergency 1 51 0 8 ... Steady No No No No No Ch Yes NO 0

5 rows × 43 columns

15.Diagnosis 1 : The primary diagnosis (coded as first three digit of ICD9):848 distint values

16.Diagnosis 2 : Secondary diagnosis (coded as first three digit of ICD9):923 distint values

17.Diagnosis 3 : Additional secondary diagnosis (coded as first three digit of ICD9):954 distint values

In [281]:
df.isnull().sum()
Out[281]:
race                           0
gender                         0
age                            0
admission_type_id              0
discharge_disposition_id       0
admission_source_id            0
time_in_hospital               0
num_lab_procedures             0
num_procedures                 0
num_medications                0
diag_1                        20
diag_2                       356
diag_3                      1419
number_diagnoses               0
max_glu_serum                  0
A1Cresult                      0
metformin                      0
repaglinide                    0
nateglinide                    0
chlorpropamide                 0
glimepiride                    0
acetohexamide                  0
glipizide                      0
glyburide                      0
tolbutamide                    0
pioglitazone                   0
rosiglitazone                  0
acarbose                       0
miglitol                       0
troglitazone                   0
tolazamide                     0
examide                        0
citoglipton                    0
insulin                        0
glyburide-metformin            0
glipizide-metformin            0
glimepiride-pioglitazone       0
metformin-rosiglitazone        0
metformin-pioglitazone         0
change                         0
diabetesMed                    0
readmitted                     0
service_utilization            0
dtype: int64

Treating missing values in diag_1,diag_2,diag_3 by dropping all the missing values

In [283]:
#diag_1,diag_2,diag_3
df=df.dropna(subset=['diag_1','diag_2','diag_3'])
In [284]:
df.isnull().sum()
Out[284]:
race                        0
gender                      0
age                         0
admission_type_id           0
discharge_disposition_id    0
admission_source_id         0
time_in_hospital            0
num_lab_procedures          0
num_procedures              0
num_medications             0
diag_1                      0
diag_2                      0
diag_3                      0
number_diagnoses            0
max_glu_serum               0
A1Cresult                   0
metformin                   0
repaglinide                 0
nateglinide                 0
chlorpropamide              0
glimepiride                 0
acetohexamide               0
glipizide                   0
glyburide                   0
tolbutamide                 0
pioglitazone                0
rosiglitazone               0
acarbose                    0
miglitol                    0
troglitazone                0
tolazamide                  0
examide                     0
citoglipton                 0
insulin                     0
glyburide-metformin         0
glipizide-metformin         0
glimepiride-pioglitazone    0
metformin-rosiglitazone     0
metformin-pioglitazone      0
change                      0
diabetesMed                 0
readmitted                  0
service_utilization         0
dtype: int64
In [285]:
#15.number_diagnoses:
df['number_diagnoses'].value_counts()
Out[285]:
number_diagnoses
9     47797
5     11165
8     10305
7     10194
6      9967
4      5464
3      2804
16       42
13       16
10       16
11       11
15       10
12        8
14        6
Name: count, dtype: int64
In [286]:
#16.max_glu_serum
df.max_glu_serum.value_counts()
Out[286]:
max_glu_serum
Test Not Done    92741
Norm              2504
>200              1398
>300              1162
Name: count, dtype: int64
In [287]:
df.max_glu_serum.value_counts()
Out[287]:
max_glu_serum
Test Not Done    92741
Norm              2504
>200              1398
>300              1162
Name: count, dtype: int64
  • It indicates the range of the result was not taken values,">200",">300","normal" and "Test Not Done" if not measured
  • We also reducing Glucose serum test into categories of Normal(1),Abnormal(2) and Not tested(0)
In [289]:
df["max_glu_serum"] = df["max_glu_serum"].replace({">200":2,">300":2,"Norm":1,"Test Not Done":0})
In [290]:
df.max_glu_serum.value_counts()
Out[290]:
max_glu_serum
0    92741
2     2560
1     2504
Name: count, dtype: int64

Encoding¶

The outcome we are looking at is whether the patient gets readmitted to the hospital within 30 days or not.

In [293]:
df['readmitted'].value_counts()
Out[293]:
readmitted
NO     51475
>30    35124
<30    11206
Name: count, dtype: int64
In [294]:
#Readmission converting multiclass to binary class
df['readmitted'] = df['readmitted'].replace({">30":0,'NO':0,"<30":1})
In [295]:
df['readmitted'].value_counts()
Out[295]:
readmitted
0    86599
1    11206
Name: count, dtype: int64
In [296]:
# 21.A1Cresult
  • A1c test result indicates the range of the result or if the test was not taken. values: ">8" if the result was greater than 8%,">7" if the result was greater than 7% but less than 8%,'normal' if the result was less than 7%, and 'none' if not measured
  • When sugar enters your bloodstream, it attaches to hemoglobin, a protein in your red blood cells.Everybody has some sugar attached to their hemoglobin, but people with higher blood sugar levels have more. The A1C test measures the percentage of your red blood cells that have sugar-coated hemoglobin.
In [298]:
df['A1Cresult'].unique()
Out[298]:
array(['Test Not Done', '>7', '>8', 'Norm'], dtype=object)
In [299]:
df['A1Cresult'].value_counts()
Out[299]:
A1Cresult
Test Not Done    81393
>8                7797
Norm              4867
>7                3748
Name: count, dtype: int64

A normal A1C levels is below 5.7%,a level of 5.7% to 6.4% indicates prediabetes,and a level of 6.5% or more indicates diabetes.Within the 5.7% to 6.4% prediabetes range,The higher your A1C.the greater your risk is for developing type 2 diabetes.

In [301]:
df['A1Cresult'] = df['A1Cresult'].replace({"Test Not Done":0,"Norm":1,">7":2,">8":2})
In [302]:
df['A1Cresult'].unique()
Out[302]:
array([0, 2, 1], dtype=int64)
In [303]:
#Gender
df['gender'] = df['gender'].replace({'Female':0,'Male':1})

1.Change: Change of medication indicates if there was a change in diabetic medication(either dosage or generic).Values:'change' and 'no change'

In [305]:
df['change'].value_counts()
Out[305]:
change
No    52346
Ch    45459
Name: count, dtype: int64

we encoded the 'medication change' feature from 'No'(no change) and 'Ch'(changed) into 0 and 1.

In [307]:
df['change'] = df['change'].replace({'No':0,'Ch':1})

1.diabetesMed: Indicates if there was any diabetic medication prescribed; Values: 'yes' and 'no'

In [309]:
df['diabetesMed'].value_counts()
Out[309]:
diabetesMed
Yes    75497
No     22308
Name: count, dtype: int64
In [310]:
# encoding diabetes med
df['diabetesMed'] = df['diabetesMed'].replace('Yes',1)
df['diabetesMed'] = df['diabetesMed'].replace('No',0)
In [311]:
df['diabetesMed'].value_counts()
Out[311]:
diabetesMed
1    75497
0    22308
Name: count, dtype: int64
In [312]:
df.columns
Out[312]:
Index(['race', 'gender', 'age', 'admission_type_id',
       'discharge_disposition_id', 'admission_source_id', 'time_in_hospital',
       'num_lab_procedures', 'num_procedures', 'num_medications', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted',
       'service_utilization'],
      dtype='object')
  • 23 features for medications For the generic names: metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin, glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone and metformin-pioglitazone.the features indicates whether the drug was prescribed or there was Values 'up' if the dosage was increased during the encounter,'down' if the dosage was decreased,'steady' if the dosage did not change, and 'no' if the drug was not prescribed
In [314]:
drug = list(df.iloc[:,16:39].columns)
drug
Out[314]:
['metformin',
 'repaglinide',
 'nateglinide',
 'chlorpropamide',
 'glimepiride',
 'acetohexamide',
 'glipizide',
 'glyburide',
 'tolbutamide',
 'pioglitazone',
 'rosiglitazone',
 'acarbose',
 'miglitol',
 'troglitazone',
 'tolazamide',
 'examide',
 'citoglipton',
 'insulin',
 'glyburide-metformin',
 'glipizide-metformin',
 'glimepiride-pioglitazone',
 'metformin-rosiglitazone',
 'metformin-pioglitazone']
In [315]:
for i in drug:
    print(i,df[i].unique())
metformin ['No' 'Steady' 'Up' 'Down']
repaglinide ['No' 'Up' 'Steady' 'Down']
nateglinide ['No' 'Steady' 'Down' 'Up']
chlorpropamide ['No' 'Steady' 'Down' 'Up']
glimepiride ['No' 'Steady' 'Down' 'Up']
acetohexamide ['No' 'Steady']
glipizide ['No' 'Steady' 'Up' 'Down']
glyburide ['No' 'Steady' 'Up' 'Down']
tolbutamide ['No' 'Steady']
pioglitazone ['No' 'Steady' 'Up' 'Down']
rosiglitazone ['No' 'Steady' 'Up' 'Down']
acarbose ['No' 'Steady' 'Up' 'Down']
miglitol ['No' 'Steady' 'Down' 'Up']
troglitazone ['No' 'Steady']
tolazamide ['No' 'Steady' 'Up']
examide ['No']
citoglipton ['No']
insulin ['Up' 'No' 'Steady' 'Down']
glyburide-metformin ['No' 'Steady' 'Down' 'Up']
glipizide-metformin ['No' 'Steady']
glimepiride-pioglitazone ['No' 'Steady']
metformin-rosiglitazone ['No' 'Steady']
metformin-pioglitazone ['No' 'Steady']
In [316]:
# Assuming 'drug' is a list containing the drug names
for i in drug:
    print(i)
    sns.countplot(x=i, data=df)
    plt.show()
metformin
No description has been provided for this image
repaglinide
No description has been provided for this image
nateglinide
No description has been provided for this image
chlorpropamide
No description has been provided for this image
glimepiride
No description has been provided for this image
acetohexamide
No description has been provided for this image
glipizide
No description has been provided for this image
glyburide
No description has been provided for this image
tolbutamide
No description has been provided for this image
pioglitazone
No description has been provided for this image
rosiglitazone
No description has been provided for this image
acarbose
No description has been provided for this image
miglitol
No description has been provided for this image
troglitazone
No description has been provided for this image
tolazamide
No description has been provided for this image
examide
No description has been provided for this image
citoglipton
No description has been provided for this image
insulin
No description has been provided for this image
glyburide-metformin
No description has been provided for this image
glipizide-metformin
No description has been provided for this image
glimepiride-pioglitazone
No description has been provided for this image
metformin-rosiglitazone
No description has been provided for this image
metformin-pioglitazone
No description has been provided for this image

variable (drugs named citoglipton and examide), all records have the same value.So essentially these cannot provide any interpretive or discriminatory information for predicting readmission so we decided to drop these two variables

In [318]:
df['citoglipton'].value_counts()
Out[318]:
citoglipton
No    97805
Name: count, dtype: int64
In [319]:
df['examide'].value_counts()
Out[319]:
examide
No    97805
Name: count, dtype: int64
In [320]:
df = df.drop(['citoglipton','examide'],axis=1)
In [321]:
l1 = ['metformin','glipizide','insulin','glyburide']
In [322]:
for i in l1:
    print(i)
    sns.countplot(x=i, data=df)
    plt.show()
metformin
No description has been provided for this image
glipizide
No description has been provided for this image
insulin
No description has been provided for this image
glyburide
No description has been provided for this image
In [323]:
for i in l1:
    print(i)
    print((df[i].value_counts(normalize=True)*100))
metformin
metformin
No        80.004090
Steady    18.355912
Up         1.064363
Down       0.575635
Name: proportion, dtype: float64
glipizide
glipizide
No        87.316599
Steady    11.348091
Up         0.778079
Down       0.557231
Name: proportion, dtype: float64
insulin
insulin
No        46.818670
Steady    30.144676
Down      11.972803
Up        11.063852
Name: proportion, dtype: float64
glyburide
glyburide
No        89.369664
Steady     9.250038
Up         0.815909
Down       0.564388
Name: proportion, dtype: float64
In [324]:
for i in l1:
    df[i]=df[i].apply(lambda x:0 if (x== 'No' or x=='Steady') else 1)
In [325]:
df.insulin.value_counts()
Out[325]:
insulin
0    75274
1    22531
Name: count, dtype: int64
In [326]:
# drop rest of the medication columns
df = df.drop(['repaglinide',
 'nateglinide',
 'chlorpropamide',
 'glimepiride',
 'acetohexamide',
 'tolbutamide',
 'pioglitazone',
 'rosiglitazone',
 'acarbose',
 'miglitol',
 'troglitazone',
 'tolazamide',
 'glyburide-metformin',
 'glipizide-metformin',
 'glimepiride-pioglitazone',
 'metformin-rosiglitazone',
 'metformin-pioglitazone'],axis=1)
In [327]:
df.head()
Out[327]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications ... max_glu_serum A1Cresult metformin glipizide glyburide insulin change diabetesMed readmitted service_utilization
1 Caucasian 0 15 Emergency Discharged to home Emergency 3 59 0 18 ... 0 0 0 0 0 1 1 1 0 0
2 AfricanAmerican 0 25 Emergency Discharged to home Emergency 2 11 5 13 ... 0 0 0 0 0 0 0 1 0 3
3 Caucasian 1 35 Emergency Discharged to home Emergency 2 44 1 16 ... 0 0 0 0 0 1 1 1 0 0
4 Caucasian 1 45 Emergency Discharged to home Emergency 1 51 0 8 ... 0 0 0 0 0 0 1 1 0 0
5 Caucasian 1 55 Emergency Discharged to home Referral 3 31 6 16 ... 0 0 0 0 0 0 0 1 0 0

5 rows × 24 columns

In [328]:
df.race.unique()
Out[328]:
array(['Caucasian', 'AfricanAmerican', 'Other', 'Asian', 'Hispanic'],
      dtype=object)
In [329]:
fig = plt.figure(figsize=(8,8))
sns.countplot(y=df['race'],hue=df['readmitted'])
Out[329]:
<Axes: xlabel='count', ylabel='race'>
No description has been provided for this image
In [330]:
race_dummies = pd.get_dummies(df['race'],drop_first=True)
race_dummies.head()
Out[330]:
Asian Caucasian Hispanic Other
1 False True False False
2 False False False False
3 False True False False
4 False True False False
5 False True False False
In [331]:
df = pd.concat([df,race_dummies],axis=1)
df.drop(['race'],inplace=True,axis=1)
df.head()
Out[331]:
gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications diag_1 ... glyburide insulin change diabetesMed readmitted service_utilization Asian Caucasian Hispanic Other
1 0 15 Emergency Discharged to home Emergency 3 59 0 18 276 ... 0 1 1 1 0 0 False True False False
2 0 25 Emergency Discharged to home Emergency 2 11 5 13 648 ... 0 0 0 1 0 3 False False False False
3 1 35 Emergency Discharged to home Emergency 2 44 1 16 8 ... 0 1 1 1 0 0 False True False False
4 1 45 Emergency Discharged to home Emergency 1 51 0 8 197 ... 0 0 1 1 0 0 False True False False
5 1 55 Emergency Discharged to home Referral 3 31 6 16 414 ... 0 0 0 1 0 0 False True False False

5 rows × 27 columns

In [332]:
fig = plt.figure(figsize=(8, 8))
sns.countplot(x='gender', hue='readmitted', data=df)
plt.title("Gender of patient vs readmission")
plt.show()
No description has been provided for this image
In [333]:
#Encoding
df['admission_type_id'].value_counts()
Out[333]:
admission_type_id
Emergency        69486
Elective         18344
Not Available     9975
Name: count, dtype: int64
In [334]:
df['admission_type_id'].unique()
Out[334]:
array(['Emergency', 'Elective', 'Not Available'], dtype=object)
In [335]:
df['admission_type_id']=df['admission_type_id'].replace({'Emergency':0,'Elective':1,'Not Available':2})

1.discharge_disposition_id

In [337]:
df['discharge_disposition_id'].value_counts()
Out[337]:
discharge_disposition_id
Discharged to home                            71928
Transferred to another facility               20574
Not Available                                  4598
Left AMA                                        604
Still patient/referred to this institution      101
Name: count, dtype: int64
In [338]:
df['discharge_disposition_id'].unique()
Out[338]:
array(['Discharged to home', 'Transferred to another facility',
       'Left AMA', 'Not Available',
       'Still patient/referred to this institution'], dtype=object)
In [339]:
df['discharge_disposition_id']=df['discharge_disposition_id'].replace({'Discharged to home':0,'Transferred to another facility':1,'Not Available':2,'Left AMA':3,'Still patient/referred to this institution':4})
In [340]:
df['discharge_disposition_id'].value_counts()
Out[340]:
discharge_disposition_id
0    71928
1    20574
2     4598
3      604
4      101
Name: count, dtype: int64

1.admission_source_id

In [342]:
df['admission_source_id'].value_counts()
Out[342]:
admission_source_id
Emergency                    55166
Referral                     29881
Not Available                 6743
Transferred from hospital     6015
Name: count, dtype: int64
In [343]:
df['admission_source_id']=df['admission_source_id'].replace({'Emergency':0,'Referral':1,'Not Available':2,'Transferred from hospital':3})
In [344]:
df['admission_source_id'].value_counts()
Out[344]:
admission_source_id
0    55166
1    29881
2     6743
3     6015
Name: count, dtype: int64
In [345]:
df.head()
Out[345]:
gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications diag_1 ... glyburide insulin change diabetesMed readmitted service_utilization Asian Caucasian Hispanic Other
1 0 15 0 0 0 3 59 0 18 276 ... 0 1 1 1 0 0 False True False False
2 0 25 0 0 0 2 11 5 13 648 ... 0 0 0 1 0 3 False False False False
3 1 35 0 0 0 2 44 1 16 8 ... 0 1 1 1 0 0 False True False False
4 1 45 0 0 0 1 51 0 8 197 ... 0 0 1 1 0 0 False True False False
5 1 55 0 0 1 3 31 6 16 414 ... 0 0 0 1 0 0 False True False False

5 rows × 27 columns

In [346]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 97805 entries, 1 to 101765
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   gender                    97805 non-null  int64 
 1   age                       97805 non-null  int64 
 2   admission_type_id         97805 non-null  int64 
 3   discharge_disposition_id  97805 non-null  int64 
 4   admission_source_id       97805 non-null  int64 
 5   time_in_hospital          97805 non-null  int64 
 6   num_lab_procedures        97805 non-null  int64 
 7   num_procedures            97805 non-null  int64 
 8   num_medications           97805 non-null  int64 
 9   diag_1                    97805 non-null  object
 10  diag_2                    97805 non-null  object
 11  diag_3                    97805 non-null  object
 12  number_diagnoses          97805 non-null  int64 
 13  max_glu_serum             97805 non-null  int64 
 14  A1Cresult                 97805 non-null  int64 
 15  metformin                 97805 non-null  int64 
 16  glipizide                 97805 non-null  int64 
 17  glyburide                 97805 non-null  int64 
 18  insulin                   97805 non-null  int64 
 19  change                    97805 non-null  int64 
 20  diabetesMed               97805 non-null  int64 
 21  readmitted                97805 non-null  int64 
 22  service_utilization       97805 non-null  int64 
 23  Asian                     97805 non-null  bool  
 24  Caucasian                 97805 non-null  bool  
 25  Hispanic                  97805 non-null  bool  
 26  Other                     97805 non-null  bool  
dtypes: bool(4), int64(20), object(3)
memory usage: 18.3+ MB
  1. diag_1,diag_2,diag_3
  • Diagnosis 1 (Nominal) The primary diagnosis(coded as first three digits of ICD9),848 distint values.
  • Diagnosis 2 (Nominal) The secondary diagnosis(coded as first three digits of ICD9),923 distint values.
  • Diagnosis 1 (Nominal) Additional secondary diagnosis(coded as first three digits of ICD9),954 distint values.
  • Diabetes:0
  • Circulatort:1
  • Respiratory:2
  • Digestive:3
  • Injury:4
  • Musculoskelet:5
  • Genitourinory:6
  • Neoplasms:7
  • Others:8
In [350]:
df.loc[df['diag_1'].str.contains('V',na=False),['diag_1']]=8        #others
df.loc[df['diag_1'].str.contains('E',na=False),['diag_1']]=8
df.loc[df['diag_2'].str.contains('V',na=False),['diag_2']]=8
df.loc[df['diag_2'].str.contains('E',na=False),['diag_2']]=8
df.loc[df['diag_3'].str.contains('V',na=False),['diag_3']]=8
df.loc[df['diag_3'].str.contains('E',na=False),['diag_3']]=8
In [351]:
df['diag_1']=df['diag_1'].astype('float')
df['diag_2']=df['diag_2'].astype('float')
df['diag_3']=df['diag_3'].astype('float')
  • Diagnosis:- stands for a primary diagnosis with possible values
  • Circulatory:- for ICD9:390-459,785,
  • Digestive:- for ICD9:520-579,787,
  • Genitourinary:- ICD9:580-629,788,
  • Diabetes:- ICD9:250,xx,
  • injury:- ICD9:800-999,
  • musculoskeletal:- ICD9:710-739
  • neoplasms:- ICD9:140-239
  • respiratory:- ICD9:460-519,786
  • other:- otherwise
In [353]:
for j in ['diag_1', 'diag_2', 'diag_3']:
    for i in range(len(df)):
        value = df.iloc[i][j]
        if np.floor(value) == 250:
            df.iloc[i, df.columns.get_loc(j)] = 0  # Diabetes
        elif (value >= 390 and value < 460) or np.floor(value) == 785:
            df.iloc[i, df.columns.get_loc(j)] = 1  # Circulatory
        elif (value >= 460 and value < 520) or np.floor(value) == 786:
            df.iloc[i, df.columns.get_loc(j)] = 2  # Respiratory
        elif (value >= 520 and value < 580) or np.floor(value) == 787:
            df.iloc[i, df.columns.get_loc(j)] = 3  # Digestive
        elif value >= 800 and value < 1000:
            df.iloc[i, df.columns.get_loc(j)] = 4  # Injury
        elif value >= 710 and value < 740:
            df.iloc[i, df.columns.get_loc(j)] = 5  # Musculoskeletal
        elif (value >= 580 and value < 630) or np.floor(value) == 788:
            df.iloc[i, df.columns.get_loc(j)] = 6  # Genitourinary
        elif value >= 140 and value < 240:
            df.iloc[i, df.columns.get_loc(j)] = 7  # Neoplasms
        else:
            df.iloc[i, df.columns.get_loc(j)] = 8  # Others
In [354]:
for i in ['diag_1','diag_2','diag_3']:
    print(df[i].unique())
[8. 7. 1. 0. 2. 4. 5. 3. 6.]
[0. 7. 1. 2. 8. 4. 5. 6. 3.]
[8. 1. 0. 2. 4. 7. 6. 5. 3.]
In [355]:
df['diag_1']=df['diag_1'].astype('int')
df['diag_2']=df['diag_2'].astype('int')
df['diag_3']=df['diag_3'].astype('int')
In [356]:
df.head()
Out[356]:
gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications diag_1 ... glyburide insulin change diabetesMed readmitted service_utilization Asian Caucasian Hispanic Other
1 0 15 0 0 0 3 59 0 18 8 ... 0 1 1 1 0 0 False True False False
2 0 25 0 0 0 2 11 5 13 8 ... 0 0 0 1 0 3 False False False False
3 1 35 0 0 0 2 44 1 16 8 ... 0 1 1 1 0 0 False True False False
4 1 45 0 0 0 1 51 0 8 7 ... 0 0 1 1 0 0 False True False False
5 1 55 0 0 1 3 31 6 16 1 ... 0 0 0 1 0 0 False True False False

5 rows × 27 columns

In [357]:
df.dtypes
Out[357]:
gender                      int64
age                         int64
admission_type_id           int64
discharge_disposition_id    int64
admission_source_id         int64
time_in_hospital            int64
num_lab_procedures          int64
num_procedures              int64
num_medications             int64
diag_1                      int32
diag_2                      int32
diag_3                      int32
number_diagnoses            int64
max_glu_serum               int64
A1Cresult                   int64
metformin                   int64
glipizide                   int64
glyburide                   int64
insulin                     int64
change                      int64
diabetesMed                 int64
readmitted                  int64
service_utilization         int64
Asian                        bool
Caucasian                    bool
Hispanic                     bool
Other                        bool
dtype: object

VISUALISATION¶

Univariate Analysis¶

gender¶

In [361]:
import plotly.express as px

# Assuming 'gender_counts' is already calculated
gender_counts = df['gender'].value_counts()

# Creating the pie chart in Plotly
fig = px.pie(values=gender_counts, names=gender_counts.index, title='Gender Distribution', 
             color_discrete_sequence=px.colors.sequential.RdBu, hole=0)

# Customizing the chart for better visualization
fig.update_traces(textposition='inside', textinfo='percent+label', 
                  marker=dict(line=dict(color='#000000', width=15)))

# Adjusting the layout
fig.update_layout(title_text='Gender Distribution', title_x=0.5)

# Showing the plot
fig.show()

age¶

In [363]:
age_counts = df['age'].value_counts()

# Plotting
plt.figure(figsize=(5, 3))  # Adjust the figure size if needed
age_counts.plot(kind='bar', color='skyblue')  # Create bar plot
plt.title('Distribution of Age')
plt.xlabel('age')
plt.ylabel('Count')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image

num_lab_procedures¶

In [365]:
plt.figure(figsize=(5,3))
sns.distplot(df['num_lab_procedures'],color='red')
Out[365]:
<Axes: xlabel='num_lab_procedures', ylabel='Density'>
No description has been provided for this image

num_procedures¶

In [367]:
import plotly.express as px

# Plotting with Plotly
fig = px.histogram(
    df,
    x='num_procedures',
    color_discrete_sequence=['red'],  # Set the color to red
    nbins=50  # You can adjust the number of bins if needed
)

# Adjust the figure size
fig.update_layout(
    width=500,  # Width in pixels
    height=300,  # Height in pixels
    title='Distribution of num_procedures',
    xaxis_title='num_procedures',
    yaxis_title='Count'
)

fig.show()

readmitted¶

In [369]:
plt.figure(figsize=(5, 3))
df['readmitted'].value_counts().plot(kind='bar')
plt.show()
No description has been provided for this image

A1Cresult¶

In [371]:
plt.figure(figsize=(5,3))
sns.countplot(x='A1Cresult', data=df, color='red')
plt.show()
No description has been provided for this image

change¶

In [373]:
plt.figure(figsize=(5,3))
df['change'].value_counts().plot(kind='bar')
plt.show()
No description has been provided for this image
In [374]:
# Iterate through each element in 'drug'
for i in drug:
    # Check if 'i' is a valid column name in the DataFrame
    if i in df.columns:
        # Plot the count of each unique value in the column
        sns.countplot(x=df[i])  # Update this line to select the column correctly
        plt.title(f'Count of {i}')
        plt.xlabel(i)
        plt.ylabel('Count')
        plt.show()
    else:
        print(f"'{i}' is not a valid column name in the DataFrame.")
No description has been provided for this image
'repaglinide' is not a valid column name in the DataFrame.
'nateglinide' is not a valid column name in the DataFrame.
'chlorpropamide' is not a valid column name in the DataFrame.
'glimepiride' is not a valid column name in the DataFrame.
'acetohexamide' is not a valid column name in the DataFrame.
No description has been provided for this image
No description has been provided for this image
'tolbutamide' is not a valid column name in the DataFrame.
'pioglitazone' is not a valid column name in the DataFrame.
'rosiglitazone' is not a valid column name in the DataFrame.
'acarbose' is not a valid column name in the DataFrame.
'miglitol' is not a valid column name in the DataFrame.
'troglitazone' is not a valid column name in the DataFrame.
'tolazamide' is not a valid column name in the DataFrame.
'examide' is not a valid column name in the DataFrame.
'citoglipton' is not a valid column name in the DataFrame.
No description has been provided for this image
'glyburide-metformin' is not a valid column name in the DataFrame.
'glipizide-metformin' is not a valid column name in the DataFrame.
'glimepiride-pioglitazone' is not a valid column name in the DataFrame.
'metformin-rosiglitazone' is not a valid column name in the DataFrame.
'metformin-pioglitazone' is not a valid column name in the DataFrame.

DiabetesMed¶

In [376]:
import plotly.express as px

# Get the counts of each unique value in the 'diabetesMed' column
med_counts = df['diabetesMed'].value_counts()

# Plotting a pie chart using Plotly
fig = px.pie(values=med_counts, names=med_counts.index, 
             title='diabetesMed Distribution', 
             labels={'names':'diabetesMed'},
             hole=0)  # Adjust 'hole' to create a donut chart if desired

# Show the plot
fig.show()

readmitted¶

In [378]:
readmit_counts = df['readmitted'].value_counts()

# Plotting a pie chart
plt.figure(figsize=(5,3))
plt.pie(readmit_counts, labels=readmit_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Readmission Distribution')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
No description has been provided for this image

discharge_disposition_id¶

In [380]:
plt.figure(figsize=(5,3))
df['discharge_disposition_id'].value_counts().plot(kind='bar')
plt.show()
No description has been provided for this image

Bivariate Analysis¶

In [382]:
import seaborn as sns
In [383]:
pd.crosstab(df['age'],df['readmitted'])
#0=not readmitted,1=readmitted 
Out[383]:
readmitted 0 1
age
5 62 1
15 437 31
25 1289 217
35 3195 410
45 8380 1015
55 15188 1659
65 19380 2481
75 22114 3040
85 14278 2047
95 2276 305
In [384]:
pd.crosstab(df['diabetesMed'],df['readmitted'])
Out[384]:
readmitted 0 1
diabetesMed
0 20086 2222
1 66513 8984
In [385]:
pd.crosstab(df['A1Cresult'],df['diabetesMed'])
Out[385]:
diabetesMed 0 1
A1Cresult
0 19599 61794
1 1225 3642
2 1484 10061
In [386]:
pd.crosstab(df['A1Cresult'],df['readmitted'])
Out[386]:
readmitted 0 1
A1Cresult
0 71839 9554
1 4388 479
2 10372 1173
In [387]:
pd.crosstab(df['A1Cresult'],df['change'])
Out[387]:
change 0 1
A1Cresult
0 45169 36224
1 2716 2151
2 4461 7084
In [388]:
pd.crosstab(df['max_glu_serum'],df['readmitted'])
Out[388]:
readmitted 0 1
max_glu_serum
0 82190 10551
1 2212 292
2 2197 363
In [389]:
pd.crosstab(df['max_glu_serum'],df['A1Cresult'])
Out[389]:
A1Cresult 0 1 2
max_glu_serum
0 76619 4810 11312
1 2407 38 59
2 2367 19 174
In [390]:
pd.crosstab(df['insulin'],df['readmitted'])
Out[390]:
readmitted 0 1
insulin
0 67185 8089
1 19414 3117
  • Insulin vs Readmission
In [392]:
# Plotting
plt.figure(figsize=(5,3))  # Adjust the figure size if needed
sns.countplot(data=df, x='insulin', hue='readmitted')
plt.title('Readmission vs Insulin')
plt.xlabel('Insulin')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
  • change vs readmitted
In [394]:
plt.figure(figsize=(5,3))
sns.countplot(x='change', hue='readmitted', data=df)
plt.xlabel('Change')
plt.ylabel('Count')
plt.title('Count of Readmissions by Change')
plt.show()
No description has been provided for this image
  • diabetesMed vs readmitted
In [396]:
plt.figure(figsize=(5,3))  

sns.countplot(x='diabetesMed', hue='readmitted', data=df)

plt.xlabel('Diabetes Medication')
plt.ylabel('Count')
plt.title('Count of Readmissions by Diabetes Medication')
plt.legend(title='Readmitted', loc='upper right')

plt.show()
No description has been provided for this image
  • Time_in_hospital vs readmitted
In [398]:
import plotly.express as px

# Plotting a count plot using Plotly
fig = px.histogram(df, 
                   x='time_in_hospital', 
                   color='readmitted', 
                   title='Readmission vs Time in Hospital',
                   labels={'time_in_hospital':'Time in Hospital', 'count':'Count'},
                   barmode='group',  # Ensures that bars for different categories are shown side by side
                   category_orders={"readmitted": ["No", "Yes"]}  # Optional: specify order if needed
                  )

# Customize layout to adjust the size and position of elements
fig.update_layout(
    width=500,  # Width in pixels
    height=300,  # Height in pixels
    legend_title_text='Readmission',  # Title for the legend
    legend=dict(
        x=1,  # Position legend on the right
        y=1,  # Position legend at the top
        traceorder="normal"
    )
)

# Show the plot
fig.show()
  • max_glu_serum vs readmission
In [400]:
# Plotting
plt.figure(figsize=(5,3))  # Adjust the figure size if needed
sns.countplot(data=df, x='max_glu_serum', hue='readmitted')
plt.title('Readmission vs max_glu_serum')
plt.xlabel('max_glu_serum')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
  • A1Cresult vs readmission
In [402]:
# Plotting
plt.figure(figsize=(5,3))  # Adjust the figure size if needed
sns.countplot(data=df, x='A1Cresult', hue='readmitted')
plt.title('Readmission vs A1Cresult')
plt.xlabel('A1Cresult')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
  • gender vs readmission
In [404]:
# Plotting
plt.figure(figsize=(5,3))  # Adjust the figure size if needed
sns.countplot(data=df, x='gender', hue='readmitted')
plt.title('Readmission vs gender')
plt.xlabel('gender')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
  • age vs readmission
In [406]:
import plotly.express as px

# Plotting a count plot using Plotly
fig = px.histogram(df, 
                   x='age', 
                   color='readmitted', 
                   title='Readmission vs Age',
                   labels={'age':'Age', 'count':'Count'},
                   barmode='group'  # Ensures that bars for different categories are shown side by side
                  )

# Customize layout to adjust the size and position of elements
fig.update_layout(
    width=500,  # Width in pixels
    height=300,  # Height in pixels
    legend_title_text='Readmission',  # Title for the legend
    legend=dict(
        x=1,  # Position legend on the right
        y=1,  # Position legend at the top
        traceorder="normal"
    )
)

# Show the plot
fig.show()
  • metformin vs readmission
In [408]:
# Plotting
plt.figure(figsize=(5,3))  # Adjust the figure size if needed
sns.countplot(data=df, x='metformin', hue='readmitted')
plt.title('Readmission vs metformin')
plt.xlabel('metformin')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
  • glipizide vs readmission
In [410]:
# Plotting
plt.figure(figsize=(5,3))  # Adjust the figure size if needed
sns.countplot(data=df, x='glipizide', hue='readmitted')
plt.title('Readmission vs glipizide')
plt.xlabel('glipizide')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
  • glyburide vs readmission
In [412]:
# Plotting
plt.figure(figsize=(5,3))  # Adjust the figure size if needed
sns.countplot(data=df, x='glyburide', hue='readmitted')
plt.title('Readmission vs glyburide')
plt.xlabel('glyburide')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
  • num_medications vs readmission
In [414]:
import plotly.express as px

# Plotting a count plot using Plotly
fig = px.histogram(df, 
                   x='num_medications', 
                   color='readmitted', 
                   title='Readmission vs num_medications',
                   labels={'num_medications':'Number of Medications', 'count':'Count'},
                   barmode='group'  # Ensures that bars for different categories are shown side by side
                  )

# Customize layout to adjust the size and position of elements
fig.update_layout(
    width=800,  # Width in pixels
    height=500,  # Height in pixels
    legend_title_text='Readmission',  # Title for the legend
    legend=dict(
        x=1,  # Position legend on the right
        y=1,  # Position legend at the top
        traceorder="normal"
    )
)

# Show the plot
fig.show()
  • num_lab_procedures vs readmission
In [416]:
import plotly.express as px

# Plotting a grouped bar chart using Plotly
fig = px.histogram(df, 
                   x='num_lab_procedures', 
                   color='readmitted', 
                   title='Readmission vs num_lab_procedures',
                   labels={'num_lab_procedures':'Number of Lab Procedures', 'count':'Count'},
                   barmode='group'  # Ensures that bars for different categories are shown side by side
                  )

# Customize layout to adjust the size and position of elements
fig.update_layout(
    width=1000,  # Width in pixels
    height=800,  # Height in pixels
    legend_title_text='Readmission',  # Title for the legend
    legend=dict(
        x=1,  # Position legend on the right
        y=1,  # Position legend at the top
        traceorder="normal"
    )
)

# Show the plot
fig.show()
  • num_procedures
In [418]:
# Plotting
plt.figure(figsize=(5,3))  # Adjust the figure size if needed
sns.countplot(data=df, x='num_procedures', hue='readmitted')
plt.title('Readmission vs num_procedures')
plt.xlabel('num_procedures')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
  • time_in_hospital vs readmission
In [420]:
# Plotting
plt.figure(figsize=(8, 6))  # Adjust the figure size if needed
sns.countplot(data=df, x='time_in_hospital', hue='readmitted')
plt.title('Readmission vs time_in_hospital')
plt.xlabel('time_in_hospital')
plt.ylabel('Count')
plt.legend(title='Readmission', loc='upper right')  # Add legend for readmission
plt.tight_layout()  # Adjust layout to prevent clipping of labels
plt.show()
No description has been provided for this image
In [421]:
df.dtypes
Out[421]:
gender                      int64
age                         int64
admission_type_id           int64
discharge_disposition_id    int64
admission_source_id         int64
time_in_hospital            int64
num_lab_procedures          int64
num_procedures              int64
num_medications             int64
diag_1                      int32
diag_2                      int32
diag_3                      int32
number_diagnoses            int64
max_glu_serum               int64
A1Cresult                   int64
metformin                   int64
glipizide                   int64
glyburide                   int64
insulin                     int64
change                      int64
diabetesMed                 int64
readmitted                  int64
service_utilization         int64
Asian                        bool
Caucasian                    bool
Hispanic                     bool
Other                        bool
dtype: object

Relation with Target variable¶

In [423]:
import pandas as pd
import numpy as np

print("According to the data there are total 50 columns")

# Select categorical columns
df_cat = df.select_dtypes(include='object')

# Select numerical columns
df_num = df.select_dtypes(include=np.number)

print('Total categorical columns : ', len(df_cat.columns))
print('\n')
print('Categorical column names : ', df_cat.columns)
print('\n')
print('Total numerical columns : ', len(df_num.columns))
print('\n')
print('Numerical column names : ', df_num.columns)
According to the data there are total 50 columns
Total categorical columns :  0


Categorical column names :  Index([], dtype='object')


Total numerical columns :  23


Numerical column names :  Index(['gender', 'age', 'admission_type_id', 'discharge_disposition_id',
       'admission_source_id', 'time_in_hospital', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'diag_1', 'diag_2', 'diag_3',
       'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin',
       'glipizide', 'glyburide', 'insulin', 'change', 'diabetesMed',
       'readmitted', 'service_utilization'],
      dtype='object')
In [424]:
# Calculate correlation matrix
correlation_matrix = df.corr()

# Calculate correlation of each feature with the target variable
correlation_with_target = correlation_matrix['readmitted'].sort_values(ascending=False)

# Print correlation of each feature with the target variable
print("Correlation with target variable 'readmitted':")
print(correlation_with_target)

# Visualize correlation matrix as a heatmap
plt.figure(figsize=(22, 18))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
Correlation with target variable 'readmitted':
readmitted                  1.000000
service_utilization         0.126385
discharge_disposition_id    0.064165
number_diagnoses            0.050132
time_in_hospital            0.045807
insulin                     0.040826
num_medications             0.039301
diabetesMed                 0.025548
num_lab_procedures          0.023401
age                         0.018844
change                      0.018185
max_glu_serum               0.013159
diag_3                      0.009672
glipizide                   0.009611
diag_2                      0.008928
Caucasian                   0.002037
gender                     -0.002070
Asian                      -0.002084
Hispanic                   -0.004049
glyburide                  -0.005139
diag_1                     -0.005402
Other                      -0.006676
metformin                  -0.007274
num_procedures             -0.011721
admission_type_id          -0.012577
A1Cresult                  -0.018297
admission_source_id        -0.019333
Name: readmitted, dtype: float64
No description has been provided for this image

Service Utilization (0.126385): There is a moderate positive correlation between service utilization and readmission. Patients who utilize more services during their stay are slightly more likely to be readmitted.

Discharge Disposition ID (0.064165): There is a weak positive correlation between the discharge disposition ID and readmission. Certain discharge dispositions may be associated with higher chances of readmission.

Number of Diagnoses (0.050132): There is a weak positive correlation between the number of diagnoses and readmission. Patients with a higher number of diagnoses tend to have slightly higher readmission rates.

Time in Hospital (0.045807): There is a weak positive correlation between the time spent in the hospital and readmission. Longer hospital stays may be associated with slightly higher readmission rates.

Insulin (0.040826): There is a weak positive correlation between insulin usage and readmission. Patients prescribed insulin may have slightly higher chances of readmission.

Number of Medications (0.039301): There is a weak positive correlation between the number of medications prescribed and readmission. Patients on more medications may have slightly higher readmission rates.

Diabetes Medication (0.025548): There is a weak positive correlation between diabetes medication and readmission. Patients prescribed diabetes medication may have slightly higher chances of readmission.

Number of Lab Procedures (0.023401): There is a weak positive correlation between the number of lab procedures performed and readmission. Patients undergoing more lab procedures may have slightly higher readmission rates.

Age (0.018844): There is a weak positive correlation between age and readmission. Older patients may have slightly higher chances of readmission.

Change (0.018185): There is a weak positive correlation between change in medication and readmission. Changes in medication regimen may be associated with slightly higher readmission rates.

Model building¶

In [427]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
In [428]:
df.head()
Out[428]:
gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications diag_1 ... glyburide insulin change diabetesMed readmitted service_utilization Asian Caucasian Hispanic Other
1 0 15 0 0 0 3 59 0 18 8 ... 0 1 1 1 0 0 False True False False
2 0 25 0 0 0 2 11 5 13 8 ... 0 0 0 1 0 3 False False False False
3 1 35 0 0 0 2 44 1 16 8 ... 0 1 1 1 0 0 False True False False
4 1 45 0 0 0 1 51 0 8 7 ... 0 0 1 1 0 0 False True False False
5 1 55 0 0 1 3 31 6 16 1 ... 0 0 0 1 0 0 False True False False

5 rows × 27 columns

In [429]:
X=df.drop('readmitted',axis=1)
y=df.readmitted
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=1)
print('X_train',X_train.shape)
print('y_train',y_train.shape)
print('X_test',X_test.shape)
print('y_test',y_test.shape)
X_train (68463, 26)
y_train (68463,)
X_test (29342, 26)
y_test (29342,)

LogisticRegression¶

In [431]:
# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)
Accuracy: 0.8839888214845614
Confusion Matrix:
[[25912    42]
 [ 3362    26]]
Classification Report:
              precision    recall  f1-score   support

           0       0.89      1.00      0.94     25954
           1       0.38      0.01      0.02      3388

    accuracy                           0.88     29342
   macro avg       0.63      0.50      0.48     29342
weighted avg       0.83      0.88      0.83     29342

DecisionTree¶

In [433]:
# Initialize and train the Logistic Regression model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)
Accuracy: 0.7860745688773771
Confusion Matrix:
[[22456  3498]
 [ 2779   609]]
Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.87      0.88     25954
           1       0.15      0.18      0.16      3388

    accuracy                           0.79     29342
   macro avg       0.52      0.52      0.52     29342
weighted avg       0.80      0.79      0.79     29342

RandomForest¶

In [435]:
# Initialize and train the Logistic Regression model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)
Accuracy: 0.8846704382795992
Confusion Matrix:
[[25947     7]
 [ 3377    11]]
Classification Report:
              precision    recall  f1-score   support

           0       0.88      1.00      0.94     25954
           1       0.61      0.00      0.01      3388

    accuracy                           0.88     29342
   macro avg       0.75      0.50      0.47     29342
weighted avg       0.85      0.88      0.83     29342

Findings¶

If the patient has the following characteristics he has a high probability of being readmitted

  • High preceding year visit
  • If the patient is discharge to another medical facility or discharged to home with health services
  • High number of diagnoses
  • If the patient is given diabetes medicines
  • If the primary diagnosed disease was of circulatory system
  • If metformin and/or insulin is not being given or the dosage is low
  • If secondary disgnosis was coming to be Diabetes
  • If A1C test was not performed

CHALLENGES ENCOUNTERED¶

  • The major challenge was to acquire sufficient domain knowledge of the medical world
  • Huge amount of Data

Business Ideas¶

  • Medical facilities can take precautionary measures with patients during their initial admission by making A1C and Maximum Glucose Serum test compulsory and providing the treatment accordingly

  • Providing extra attention and care to high-risk patients

  • A follow-up with the discharged patients should be done to keep a track of their health and to counsel them from time-to-time

  • High-risk patients current medicines regime should be re-evaluated and the most effective medicines should be considered

  • Most effective Medications:- Metformin,Glipizide,Insulin

  • The annual plan,financials and infrastructure/inventory should be planned accordingly by taking into account the predicted readmissions.

Conclusions¶

1.Patients with higher service utilization, longer time in hospital, and higher number of medications are more likely to be readmitted.

2.Medications such as Metformin, Glipizide, and Insulin show significant association with reduced readmissions, highlighting their importance in diabetes management.

3.Demographic factors such as age, gender, and race do not have a strong correlation with readmissions, indicating that clinical factors play a more significant role.

4.Predictive models, such as logistic regression and decision trees, can accurately predict hospital readmissions for diabetic patients based on clinical and demographic variables.

5.Feature engineering techniques, such as combining service utilization variables, can improve the predictive performance of models.

6.Preserving outliers in the dataset is crucial as they often represent extreme cases with significant predictive power.

7.Addressing class imbalance in the target variable improves the generalization of predictive models and reduces bias.

8.Model regularization techniques help prevent overfitting and improve the robustness of predictive models.

9.Annual planning and resource allocation in healthcare facilities should consider predicted readmissions to optimize care delivery and reduce costs.

10.Implementing targeted interventions, such as compulsory A1C and Maximum Glucose Serum tests for high-risk patients, can help prevent readmissions and improve patient outcomes.

  • These conclusions and business ideas can serve as valuable insights for healthcare providers, policymakers, and stakeholders looking to address the challenge of hospital readmissions for diabetic patients and improve overall healthcare outcomes.

END¶